DEV Community: sandipan bhaumik

Monitor RAG Data Source Quality

sandipan bhaumik — Mon, 16 Feb 2026 23:19:31 +0000

RAG data source monitoring is a critical gap I've seen in enterprise AI systems that few teams address until production failures force the issue. This is about maintaining the reliability of what you retrieve, not just what you generate. It's not the only approach to RAG quality, but it's one that works when web sources are mission-critical and silent degradation isn't acceptable.

Retrievals degrade silently

Your enterprise RAG system answered a compliance question with outdated guidance. The legal team caught it during review. Three hours before a regulatory filing deadline.

The error logs show nothing unusual: Retrieved 3 sources, generated response, confidence: 0.94

Your retrieval worked. Your LLM worked. The system architecture performed exactly as designed.

So what broke?

Investigation reveals: One of your primary data sources - an FDA guidance document your RAG system has cited for six months - was updated three weeks ago. The page structure changed. Your retrieval still fetched the URL successfully, but now it's pulling from an outdated archive version the site automatically redirects to.

Your RAG system has been confidently generating responses based on deprecated regulatory guidance for 21 days. Nobody knew.

Cost: Near-miss on regulatory compliance. Trust in the AI system is damaged. Emergency audit of all RAG sources initiated.

This is the hidden liability in production RAG systems.

The RAG data quality problem

Retrieval-Augmented Generation changed how enterprises build AI systems. Instead of fine-tuning models with static knowledge, we retrieve fresh context from authoritative sources and augment the LLM's response.

RAG patterns promise always current information, cite your sources, and reduce hallucinations. However, in reality, RAG systems are only as reliable as their sources. And as you know, web sources decay.

Let's have a look at what enterprise RAG systems usually depend on:

Regulatory guidance - FDA guidelines, SEC filings, compliance documents
Technical documentation - API specs, integration guides, security advisories
Medical literature - Clinical studies, treatment protocols, drug interactions
Legal precedents - Case law, statute changes, regulatory updates
Financial data - Market analyses, economic indicators, company filings
Internal knowledge bases - Confluence pages, SharePoint docs, wiki content

What happens to these sources over time:

Links break - Pages move, sites restructure, domains expire
Content changes - Updates happen without announcement
Paywalls appear - Previously free content requires authentication
Sites go offline - Vendors sunset products, projects get archived
Structure shifts - Page layout changes break content extraction
Information becomes stale - Content exists but is outdated

The problem is, your RAG system doesn't know about these changes. It retrieves what it can, generates a response, and returns high confidence. The degradation is invisible.

Why traditional monitoring misses this

Traditional observability stack tracks:

LLM API latency and errors
Retrieval success rate (did we fetch something?)
Vector database query performance
End-to-end response times

What it doesn't track:

Did the retrieved content actually match what we expected?
Has the source's information changed significantly?
Is this source still authoritative and current?
Are we retrieving from the intended page or a redirect?

The gap: Most RAG monitoring focuses on system performance (speed, uptime, errors) but not data quality (accuracy, freshness, relevance).

You find out about source degradation when:

Users report incorrect responses
Internal subject matter experts notice outdated information
Regulatory review catches compliance issues
An audit compares RAG outputs to current sources

By then, your system has been generating unreliable responses for days or weeks.

How to build a RAG source data quality monitoring system

We will build an automated RAG data source quality monitor that:

Validates source accessibility - Is the URL still reachable? Is it redirecting?
Detects content drift - Has the page content changed significantly?
Tracks content freshness - When was this source last updated?
Scores source reliability - Which sources are stable vs. degrading?
Alerts on degradation - Notify teams before RAG quality suffers

The system runs continuously, checking your defined sources every 6-24 hours, and alerts you to quality issues before they cascade into hallucinations or compliance problems.

Sequence flow overview:

What makes this work: Bright Data SERP API technically solves this problem by using the real-time, comprehensive index of a search engine to monitor and validate the health of your RAG's external sources, which is a much more robust and scalable approach than traditional methods.

Here is a breakdown of how it works technically:

Technical Function	How it Addresses the Problem
Real-time Search Index	The API leverages a search engine's up-to-date crawl data, meaning changes to a regulatory page (like an FDA guidance update) are reflected within hours of the search engine finding them.
Structured JSON Results	It provides clean, structured JSON metadata about the source instead of raw HTML. This eliminates the need for you to perform complex and brittle HTML parsing, which often breaks when a website's structure changes.
Verification of Indexing & Accessibility	It searches the web in real-time to verify a source is still indexed and accessible, instantly detecting issues like broken links, unannounced redirects, or pages going offline.
Infrastructure Handling	It manages the complex infrastructure of web scraping, including proxies, rate limiting, and CAPTCHA solving. This allows a single, lightweight API call to validate multiple sources quickly, rather than you having to build a massive, complex fetching system.
Content Change Detection	By tracking the search metadata, it can detect a "Significant content change detected" event, which is what triggers the quality score drop (e.g., from 92/100 to 45/100 in Scenario 2), alerting you to content drift before it impacts RAG output.

Real Enterprise Scenario

Let's make it real. Consider a scenario where a Healthcare AI company provides clinical decision support. It uses some mission critical RAG sources to power it is support assistant agent:

FDA medical device guidance
Clinical trial databases
Medical journal guidelines
Drug interaction databases
Treatment protocol repositories

Scenario 1: The Cost of Unmonitored Sources

Not monitoring these sources could result in silent failures that are ultimately detected by end-users. This erodes trust. The table below depicts such a scenario.

Detail	Description
Event	In November 2024, the FDA updated its AI/ML medical device guidance with new risk classifications.
Notification	The update was posted on FDA.gov, but no direct notification was sent to external systems.
System Awareness	Zero. The RAG system continued to use outdated information.
Discovery	A clinical user noticed an outdated risk category in an AI recommendation.
Impact	2 weeks of potentially incorrect guidance cited. The error triggered an emergency source audit and consumed 40 hours of Subject Matter Expert (SME) review time.
Root Cause	The company had no automated process to monitor the FDA site for content changes.

Scenario 2: Proactive Detection with Source Monitoring

Now let's look at how this scenario plays out when these data sources are monitored using SERP APIs.

SERP API driven searches detect changes that affect the quality score. This raises an alert that gets sorted within 8 hours of the change.

Detail	Description
Source	FDA AI/ML Medical Device Guidance
Quality Score	Dropped from 92/100 to 45/100
Issue	Significant content change detected
Time to Discovery	4 hours after the FDA published the update

Result: The clinical team received the alert within 4 hours. They reviewed the new guidance, updated their RAG source configuration, and validated recommendations before any incorrect responses were served to users.

Why SERP APIs vs. Direct URL Fetching

You have three options for monitoring RAG source quality - 1) is fetching and parsing each URL yourself: you hit every page, parse HTML, and hope the structure doesn't break, burning infra and still missing moved URLs. 2) Relying on RSS feeds or changelogs, which many sources don't offer and which rarely tell you what actually changed. 3) Using SERP APIs: let search engines track changes, redirects, and indexing for you, via lightweight, structured search metadata.

Approach	Detection Speed	Infrastructure	Reliability	Coverage	Cost
Direct fetching	Hours-Days	High (parsing)	Medium (brittle)	Depends on robots.txt	High
RSS/change logs	Immediate (if available)	Low	Low (incomplete)	Limited	Low-Medium
SERP APIs	Hours	Low	High	Comprehensive	Low-Medium

Why Bright Data SERP API works:

Real-time search index - Changes reflected within hours of search engine crawl
Structured JSON results - No HTML parsing, clean metadata extraction
Global coverage - Monitor sources in any geography, any language
Infrastructure handled - Proxies, rate limiting, CAPTCHA solving managed
Batch queries - Validate 100+ sources in seconds
Historical data - Track source quality trends over time

The alternative is building fetching infrastructure that respects rate limits, parses diverse HTML structures, and handles authentication - all for a non-core capability.

Production Deployment Patterns

If you put this into production, teams usually standardize on a few repeatable deployment patterns rather than ad‑hoc scripts. In practice, the choice comes down to how fast you need to detect issues and how much monitoring budget you have. Here's how those patterns line up:

Pattern	How it works	Check frequency examples	Best for
Scheduled source validation	Run a recurring job that validates each source and updates health metrics and alerts.	Critical: every 6 hours; Standard: daily; Low‑change: weekly	Stable sources that rarely change, where daily detection is good enough.
Continuous monitoring with adaptive intervals	Long‑running service that adjusts check frequency based on how often each source changes.	Recently changed: every 2 hours; Stable: every 48 hours	Mixed source stability and cost sensitivity, where you want fast detection only for "hot" sources.
Event‑driven source validation	Hook validation into the RAG pipeline and trigger checks when quality signals degrade or for key flows.	On quality drop, before critical queries, or after notable retrieval anomalies	Mature RAG observability setups that want to tie source health directly to system performance.

Integration with RAG Observability

To make this monitor useful, you need to wire it into your existing RAG observability stack, not leave it as a standalone script. The monitor should emit structured metrics such as source quality scores over time, availability rates, content drift frequency, mean time to detect issues, and false positive rates. You can then correlate these with RAG performance signals (accuracy, user corrections, escalation volume) to see how source degradation impacts answers and automate root‑cause analysis. Finally, route alerts by severity into your incident channels, with impact and recommended actions included for fast triage.

For readers interested in SERP‑powered RAG, Bright Data's "How to Build a RAG Chatbot Using GPT Models and SERP API."

When this approach makes sense

This monitoring strategy is worth implementing when:

Your RAG system cites regulated content - Healthcare, finance, legal, or compliance domains where citing outdated sources creates liability.
You depend on 10+ external web sources - If your RAG only uses internal documents, version control handles this. If you retrieve from dozens of external sites, manual monitoring doesn't scale.
Response accuracy is critical - Customer-facing systems, decision support tools, or automated workflows where wrong answers have real consequences.
Sources change frequently - Government sites, regulatory agencies, and technical documentation update regularly without notification.
You operate at scale - Processing hundreds or thousands of queries daily means even a 1% error rate from degraded sources impacts many users.

This doesn't make sense when:

All sources are internal and version-controlled - Your internal wiki/Confluence is already tracked by your CMS.
Low consequence of errors - Internal research tools where users verify information anyway.
Very small source set - If you only retrieve from 2-3 highly stable sources, manual monitoring is sufficient.
Sources rarely change - Historical documents, archived content, or static reference material don't need real-time monitoring.

If you're not ready, start with basic retrieval monitoring (can we fetch the URL?). Graduate to content validation (is the content what we expect?) before implementing drift detection.

Beyond Source Validation

This guide focuses on monitoring source quality for existing RAG systems. The same SERP API approach can extend to many other use-cases:

Source discovery - Find new authoritative sources on emerging topics by monitoring search rankings.
Competitive analysis - Track what sources competitors' RAG systems cite by analyzing their public responses.
Content gap detection - Identify topics where authoritative sources don't exist or are insufficient.
Source diversification - Monitor alternative sources to reduce dependency on any single provider.

The pattern is consistent: Use SERP APIs to maintain visibility into the web ecosystem your RAG system depends on but doesn't control.

Getting Started

The full implementation is available on GitHub Repository. To run it locally, you'll need Python 3.10+, Ollama with the llama3.1 and nomic-embed-text models pulled, and a Bright Data API key for the web monitoring checks.

Clone the repo, create a virtual environment, and install dependencies:

git clone https://github.com/sanbhaumik/rag-data-quality-monitor
cd rag-data-quality-monitor
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

Copy .env.example to .env and fill in your credentials - at minimum, your BRIGHT_DATA_API_KEY and Gmail SMTP settings for email alerts. If you prefer OpenAI over Ollama, set LLM_BACKEND=openai and add your OPENAI_API_KEY.

Then launch the app:

./start_app.sh

This opens a Streamlit dashboard at http://localhost:8501 where you can ingest source data, ask questions via the RAG interface, trigger monitoring checks, and view the source health dashboard. The README covers all configuration options and the test suite in detail.

About the Author

Sandipan Bhaumik has spent 18 years building production data and AI systems for enterprises across finance, healthcare, retail, and software. He helps organizations move from AI demos to production systems that deliver measurable business value.

Connect: LinkedIn | Newsletter

Build a Custom Lead Enrichment Layer to Find Signal in Noise

sandipan bhaumik — Thu, 29 Jan 2026 22:42:27 +0000

About This Deep-Dive

Bright Data sponsored this technical walkthrough on custom lead enrichment — a challenge I’ve seen sales teams face when standard tools don’t track their specific buying signals.

This is one approach that works for teams with clear signal-to-pipeline correlation data. It’s not the only approach, and it’s not right for everyone. I’ll show you when it makes sense and when it doesn’t.

This article is purely based on my personal research and solution that I have built. Opinions are mine.

Generic Enrichment Misses Your Specific Signals

Your contact database tells you company size, funding history, industry classification, and verified emails. Essential baseline data that every sales team needs. Contact databases are built for breadth, they can’t customize for the specific signals

What it doesn’t tell you:

A prospect just posted 15 engineering jobs mentioning the exact tech stack you integrate with
A frustrated customer left a G2 review yesterday complaining about the problem you solve
They announced a partnership this morning that makes them a perfect fit
Their new CTO published a blog post about the strategic initiative you enable

These signals create urgency. And they’re invisible to standard enrichment platforms.

Your buying signals are unique to your product and market. Standard enrichment can’t predict what creates urgency for your deals.

Why SERP APIs vs. Building Your Own

You have four options for getting this data:

Build custom scrapers for job boards, review sites, company blogs, and news sources. Full control, no per-query cost. But you’re looking at 3–6 months of development, ongoing maintenance as sites change their HTML, dealing with proxies and anti-bot measures, and potential legal risk. Only makes sense if you have dedicated engineering resources and long-term commitment.
Use RSS feeds and public APIs where available. Free or low-cost. But coverage is spotty, updates are delayed, and data formats are inconsistent. Works for specific high-value sources like company blogs or press releases, not for comprehensive signal tracking.
Manual research. Zero tooling cost. Doesn’t scale, quality is inconsistent, and your sales ops team has better things to do than Google every prospect. Fine if your entire TAM is under 100 accounts.
SERP APIs (this approach). Search engines already index everything within hours. One API, one authentication. Bright Data handles the proxies, CAPTCHAs, and rate limiting. You get clean JSON responses and can add new signal types just by changing search queries. Fast to build, comprehensive coverage, real-time updates. The tradeoff: per-query cost and vendor dependency.

This tutorial focuses on option 4 because it’s the fastest path to value for most teams with clear signal definitions and more than 50 priority accounts per week.

What One Team Discovered

A sales ops leader at a mid-market data infrastructure company implemented custom signal tracking. Their ideal customers were companies migrating to modern data warehouses.

They defined one specific signal to track: job postings mentioning “data engineer,” “Snowflake,” “dbt,” or “modern data stack.”

What happened in Q2:

Caught 73 companies posting these exact jobs
Sales reached out within 48 hours with relevant case studies
Result: Sales team stopped wasting time on cold accounts and focused on companies showing active buying signals
Qualified meeting rate jumped from 12% to 34%

Pipeline value from these signal-driven leads: $1.2M

Their standard enrichment tool would have shown these companies were “in the data/analytics industry” with “50–200 employees.”

True but useless. The timing signal — active hiring for their exact use case was invisible.

That’s the capability you’re about to build.

When This Approach Makes Sense

Alright, before we dive into implementation, let’s be clear about when custom signal tracking is worth the investment.

Custom signal tracking is worth it when you’re past guesswork: sizable ARR, clear ICP, 60+ day sales cycles, and proof that certain signals predict pipeline. It’s also useful when standard enrichment isn’t enough, you have technical bandwidth to maintain it, and your team will actually personalize outreach based on fresh signals.

It’s not worth it if you’re still defining your ICP, your sales cycle is under 30 days, you don’t have signal-to-conversion data, your team won’t personalize anyway, or you lack resources to maintain it.

If you’re not ready, fix the basics first.

What You’ll Build

A Python enrichment engine that complements your existing contact data with custom, real-time signals:

Input: Company domain and your custom signal definitions

Output: Intelligence your standard tools don’t provide

The engine tracks:

Custom hiring signals — Job postings mentioning your use case, tech stack, or pain points
Product/competitive intelligence — Announcements about tools you integrate with or compete against
Customer sentiment signals — Recent reviews revealing pain points you solve
Strategic direction signals — Blog posts, interviews, initiatives aligned with your value prop
Partnership/integration signals — Announcements that create new opportunities
Industry-specific triggers — Regulatory changes, compliance deadlines, technology migrations

Finally, it provides scoring based on what actually predicts deals in YOUR pipeline, and personalized conversation starters from fresh signals.

Architecture Overview:

This setup combines your existing CRM enrichment with a real-time signal layer, so reps get both context and timing. It helps sales outreach stay relevant by triggering action from fresh events like hiring spikes, news, and partnerships.

Step 1: Define Your Custom Signals

Before writing any code, sit down and figure out which signals actually predict deals in your pipeline. This is the hardest part and the most important.

Here’s what the config looks like:

CUSTOM_SIGNALS = {
"hiring_signals": {
"keywords": ["data engineer", "Snowflake", "dbt"],
"weight": 30,
"query_template": "{company_name} hiring {keyword}"
},
"pain_point_signals": {
"keywords": ["manual data processes", "data quality issues"],
"weight": 35,
"query_template": "{company_name} {keyword}"
}
}

The real work is picking the right keywords for YOUR product. A DevTools company cares about “Next.js migration” and “TypeScript adoption.” An HR tech vendor tracks “rapid hiring” and “onboarding challenges.” A security startup monitors “breach disclosure” and “compliance audit.”

The GitHub repo has pre-built configs for 8 industries. Pick yours, test it on 20 accounts, tune the keywords, then scale.

Step 2: Build the Core Engine

Two pieces make this work: a SERP client that talks to Bright Data’s API, and a signal tracker that scores what it finds.

The SERP client searches and filters results:

def search_for_signals(self, query, result_count=3):
response = self.query(query)
# Filter for substantive descriptions (60–600 chars)
# Return clean JSON with title, URL, description

The signal tracker loops through your signal definitions, builds queries, detects matches, and calculates a score:

def track_signals(self, company_name):
for signal_type, config in CUSTOM_SIGNALS.items():
for keyword in config['keywords']:
query = config['query_template'].format(
company_name=company_name,
keyword=keyword
)
results = self.serp.search_for_signals(query)
if self._signal_detected(results, [keyword]):
total_score += config['weight']

Change the signal definitions in your config file, and the tracker adapts automatically. No code changes needed.

Why SERP APIs work for this: Search engines already index job boards, review sites, company blogs, and news within hours. You get real-time signal detection without building and maintaining scrapers for dozens of different sites. Bright Data handles the proxies, CAPTCHAs, and rate limiting. You just get clean JSON responses.

The complete implementation with error handling, conversation starters, and CRM export is in the GitHub repo.

What You Get: Real Output

Run this on a company and you get a scored analysis with conversation starters. Here’s what it looks like for Anthropic:

**ANTHROPIC - Custom Signal Analysis**
Signal Score: 85/100 (High Intent)
Signals Detected:
- Hiring for ML Engineers with Snowflake experience
- G2 reviews mentioning data infrastructure challenges
- Recent blog post on AI data strategy

**Conversation Starters:**
1. "Saw you're hiring ML engineers with Snowflake experience…"
2. "Read feedback about data infrastructure scaling challenges…"
3. "Just saw your post on AI data systems - curious how this fits your roadmap?"

It also exports a CSV ready to import into Salesforce or HubSpot as custom fields. Your sales team sees these signals right alongside the standard firmographic data.

The Economics: Layering Intelligence

You’re not replacing your contact database. You’re adding a layer on top. Your existing tool gives you the basics: contact info, company size, tech stack, funding history. Keep paying for that. It’s essential for working at scale.

This custom signal layer costs about $0.30–0.50 per lead enriched (5–6 search queries). But you don’t run it on every lead. Just your top 50–100 priority accounts each week.

So the math looks like this: Marketing generates 500 leads monthly, your standard enrichment handles all of them. Sales picks 200 priority accounts. You run custom signals on those 200. Cost: $60–100 for signals that month, on top of whatever you’re already paying for baseline enrichment.

What you get for that $60–100: conversation starters based on what happened this week, not generic firmographics that everyone else has too.

Compare these two outreach messages:

Without custom signals:

"Hi [Name], I noticed you work in data infrastructure. 
We help companies modernize their data stack…"

With custom signals:

"Hi [Name], 
I saw you just posted for 3 data engineers with Snowflake experience.
We helped [similar company] scale their Snowflake deployment from 50TB to 500PB.
Relevant case study attached…"

You can decide which one works. The second one gets opened.. The first one gets ignored.

How Teams Actually Use This

Most teams run this one of four ways:

Priority account enrichment: Pull your top 50–100 accounts from CRM each week, enrich them, push the results back as custom fields. Sales sees fresh signals right next to standard data.
Trigger-based: When a new lead enters the pipeline and matches your ICP criteria, run the enrichment automatically. High-score accounts get routed to your best reps.
Weekly monitoring: Track your existing pipeline for signal changes. When a low-intent account from last month suddenly posts relevant jobs, move them to priority.
Account research automation: AE requests research on a target account, system pulls latest signals and generates a brief with conversation starters. They walk into the call with the current context.

The GitHub repo has example configs for DevTools, HR Tech, Security, Data/Analytics, Fintech, SaaS, MarTech, and Infrastructure companies. Pick the one closest to your market, customize the keywords to match what actually predicts YOUR deals, test on 20 accounts, then scale.

Get Started

The GitHub repo has pre-built signal configs for 8 industries: DevTools, HR Tech, Security, Data/Analytics, Fintech, SaaS, MarTech, and Infrastructure. Each includes 5–6 signals proven to predict deals, example companies to test with, and customization guidance.

Quick start:

Sign up for Bright Data SERP API (free trial included)
Clone the repo and add your API credentials
Pick your industry config or customize your own
Run: python cli.py enrich — domain example.com
Start with 20 test accounts to validate signal quality, tune your keywords, then scale to your weekly priority account list.

Common Mistakes to Avoid

Over-engineering from day one. Don’t start with 15 signal types. Pick 3–4 you know correlate with deals. Test them. Then add more.
Skipping validation. Run this on 20 known-good accounts first. Check if the signals are actually present and relevant. Tune your keywords before scaling. Garbage in, garbage out.
No feedback loop. Track which signals led to meetings and deals. Double down on what works. Kill what doesn’t. Signals that don’t predict pipeline are just noise.
Enrich everything. Don’t run this on every single lead. Focus on priority accounts. Set query budgets. Cache results for 7–14 days to balance freshness against cost.
Building it but not using it. Train your sales team on how to reference signals in outreach naturally. Provide templates. Without execution, perfect data is worthless.

When Things Go Wrong

Getting irrelevant results? Your keywords are probably too broad. “Engineer” matches everything. Try “Senior Data Engineer” instead. Add site filters like site:linkedin.com/jobs to your query template.
Test queries manually in Google first to see what you’ll get.
Too slow? You’ve enabled too many signals or set result_count too high. Disable low-value signals, reduce result count to 3–5, and implement caching for signals that don’t need real-time updates.
Costs higher than expected? You’re enriching every lead instead of just priority accounts. Implement query deduplication, cache results, and kill any signals with less than 10% conversion correlation. Only enrich accounts that score above your ICP threshold.
CRM integration failing? Start with CSV export and manual import to test your field mapping. Check the CRM API docs for field type limits. Implement batch uploads with delays to avoid rate limiting.

Implementation Resources

I’ve put together templates: complete code explanation, industry-specific configs for all 8 verticals, CRM integration templates for Salesforce and HubSpot, a cost calculator spreadsheet, and a troubleshooting decision tree.

Download the Implementation Kit

Questions? Reach out on LinkedIn.

About the Author

Connect: LinkedIn | Newsletter

[Boost]

sandipan bhaumik — Mon, 29 Dec 2025 16:23:46 +0000

Build a Competitive Intelligence Agent in Under 400 Lines of Python

sandipan bhaumik ・ Dec 29

#ai #agents #python #api

Build a Competitive Intelligence Agent in Under 400 Lines of Python

sandipan bhaumik — Mon, 29 Dec 2025 14:18:22 +0000

The Problem: Manual Competitive Research Doesn’t Scale

Picture this: Your product team wants to understand what OpenAI just launched. Your sales team needs to know how Anthropic positions Claude against competitors. Your executives want weekly updates on the AI market landscape.

Right now, someone on your team is:

Opening 20+ browser tabs to Google different queries
Copy-pasting snippets into a Google Doc
Trying to remember which article said what
Formatting everything into a deck for the Monday meeting
Starting over next week when the questions change
Each competitive analysis takes 30–45 minutes. Multiply that by every competitor, every week, and you’ve got a full-time job that’s still slow, inconsistent, and impossible to scale.

There’s a better way.

The Solution: Automate Google Searches with SERP APIs

Here’s what changes when you use Bright Data’s SERP (Search Engine Results Page) API (aff): Instead of manually Googling and clicking through results, you programmatically query search engines and get back structured JSON data. No browser. No clicking. No copy-paste.

In this tutorial, you’ll build a production-ready competitive intelligence agent that:

Takes company domains as input (openai.com, anthropic.com)
Runs targeted Google searches via Bright Data SERP API
Extracts and organizes intelligence automatically
Generates professional PDF reports with sources

Time to build: 2 hours
Time saved per report: 40+ minutes
Code: ~350 lines of clean Python

Let’s build it.

Why SERP APIs Matter

Before we dive into code, let’s talk about why SERP APIs are the right tool for this job.

The search engine problem: Google actively blocks automated scrapers. You’d need to manage proxies, handle CAPTCHAs, deal with IP bans, and maintain brittle HTML parsers that break when Google changes their layout.

What SERP APIs solve:

Reliability: Proxy rotation, CAPTCHA solving, and rate limiting handled automatically
Global coverage: Get search results from any country (gl=US, gl=UK, gl=JP) instantly
Structured data: Clean JSON responses with titles, URLs, and descriptions already parsed
Legal compliance: Operates within terms of service — no gray area
Real impact: With traditional scraping, you’d spend weeks building infrastructure before writing any intelligence logic. With SERP APIs, you get to the value in hours.

Getting Started: Clone and Setup

The complete codebase lives on GitHub. For installation and setup, follow the README:

bash
git clone https://github.com/sanbhaumik/bright-data-serp-apis.git
cd competitive-intel-agent
pip install -r requirements.txt

Important: You’ll need [Bright Data SERP API](https://get.brightdata.com/2039fnr15xfy) credentials. Sign up, create a SERP API zone, and add your credentials to a .env file:

bash
cp .env.example .env

Edit .env with your API_KEY and ZONE

The README walks through setup in detail. This blog focuses on how the code actually works.

Architecture: Ask Better Questions, Get Better Intelligence

The secret to useful competitive intel isn’t running more searches — it’s asking the right questions. Our agent runs 4 strategic searches per company:

Query 1: Market Positioning

“anthropic.com vs competitors”

How do they differentiate?
What's the competitive landscape?

Query 2: Customer Intelligence

“anthropic.com customers case study”

Who uses their products?
What problems are they solving?

Query 3: Strategic Moves

“anthropic.com funding OR acquisition OR partnership”

What deals are they making?
Who's investing?

Query 4: Product Strategy

“anthropic.com product launch OR new feature”

What are they building?
Where's the roadmap heading?

Why this works: These queries map to the questions your executives actually ask.

“How do we compare?”
“Who are their customers?”
“What are they building?”

You get answers, not data dumps.

How the Code Works: 3 Core Components

This section details SerpClient, a Python wrapper for the Bright Data SERP API. It explains how the API solves common scraping problems (proxies, CAPTCHAs, parsing) by providing reliable, structured JSON data for all search queries.

request" width="720" height="384">

1. SerpClient: Your Data Gateway

serp_client.py wraps the Bright Data SERP API in ~50 lines. Here’s why SERP APIs matter:

The scraping problem: Google actively blocks automated scrapers.

You’d need to:

Manage proxy rotation (which is an expensive operation)
Solve CAPTCHAs (honestly, at scale, it is complex)
Maintain brittle HTML parsers that break constantly
Handle rate limits and IP bans (frustating, always)
The SERP API solution: All that infrastructure is handled for you.

You get:

Reliable access through automatic proxy rotation
Clean JSON responses (no HTML parsing)
Global geo-targeting (gl=US, gl=UK, gl=JP)
Legal compliance within terms of service

The SerpClient class handles authentication and request formatting:


python
def query(self, keyword, gl='us', hl='en'):
    payload = {
        "zone": self.zone,
        "url": f"https://www.google.com/search?q={quote_plus(keyword)}&gl={gl.upper()}&hl={hl}",
        "format": "json"
    }
    response = requests.post(
        "https://api.brightdata.com/request",
        json=payload,
        headers={"Authorization": f"Bearer {self.api_key}"}
    )
    return response.json()

The get_multiple_results() method adds smart filtering — keeping only results with substantial descriptions (50–500 characters) to filter out low-quality content.

2. CompetitiveIntelAgent: Where Intelligence Happens

enrichment_agent.py (~100 lines) orchestrates the 4 strategic queries and structures the results:


python
def research_company(self, domain, company_name=None):
    # Run 4 targeted queries
    positioning = self.serp.get_multiple_results(
        f'{domain} vs competitors', count=3
    )

    customers = self.serp.get_multiple_results(
        f'{domain} customers case study', count=3
    )

    strategic_moves = self.serp.get_multiple_results(
        f'{domain} funding OR acquisition OR partnership', count=3
    )

    product_news = self.serp.get_multiple_results(
        f'{domain} product launch OR new feature', count=3
    )

    return {
        "company_name": company_name,
        "domain": domain,
        "positioning": positioning,
        "customers": customers,
        "strategic_moves": strategic_moves,
        "product_strategy": product_news
    }

What makes this powerful: Each insight includes the source URL. When your executive asks “where did this come from?”, you have verification built in.

That’s the difference between “interesting research” and “intelligence we can act on.”

3. PDF Generator + Main Orchestration

pdf_generator.py (~200 lines) transforms raw intelligence into professional reports with:

Title page with metadata
Organized sections per competitor
Clickable source URLs for verification
Clean formatting for executive consumption

main.py ties everything together and provides flexible output formats:

python
reports = research_competitors(
    ["openai.com", "anthropic.com"],
    output_format='text',  # or 'json' or 'pdf'
    generate_pdf=True
)

ROI: Why This Matters

Manual competitive research takes 45 minutes per competitor. This agent does it in 15 seconds for about twenty cents in API calls.

But if you ask me, speed isn’t the real win — it’s consistency. Manual research quality depends on who’s doing it and when. This agent asks the same strategic questions every time and delivers the same professional format. Your sales team gets reliable intel whether it’s Tuesday morning or Friday afternoon.

Scale to ten competitors: manual research takes a full workday. The agent finishes in under a few minutes. Honestly, it’s the difference between having competitive intelligence and not having it at all.

Production Extensions

Now, of course this architecture grows with your needs. Teams I’ve worked with run weekly automated scans of their top twenty competitors, pushing updates directly into Slack every Monday. Others integrate it into their sales workflow — before every major pitch, the agent researches the prospect’s competitors and drops a briefing into Salesforce.

More sophisticated implementations track trends over time, comparing this week’s intelligence against last month’s to spot momentum shifts. Is a competitor launching features faster? You catch it automatically instead of three months too late.

Because you control the queries, you tune for your industry — healthcare teams track FDA approvals, fintech monitors regulatory news, SaaS watches integration announcements. Same codebase, different strategic questions.

There are so many opportunities here. Tell me in comments what you are thinking of.

Key Takeaway

You can build production-ready competitive intelligence tools in an afternoon that would take weeks with traditional approaches. I have done this, you can too.

This tutorial demonstrates:

Simple integration — 50 lines for the SERP client
Real business value — Automated competitive research in 15 seconds
Clean architecture — Easy to extend and customize
Professional output — PDF reports executives trust

The code is modular, the approach is scalable, and you can deploy this today.

Next steps: Get your Bright Data SERP API credentials, clone the repo, follow the README, and run your first analysis. Fifteen minutes to set up.

When that PDF generates with fresh intelligence and source attribution, you’ll immediately think of three other use cases. Prove the value, then extend where it matters.

Connect with me on LinkedIn where I share insights on AI engineering, data architecture, and building production AI systems. I write about what actually works in the field — no fluff, just practical implementation strategies.

Connect on LinkedIn