Lakshay Nasa for Extract by Zyte

Posted on Sep 19

Supercharge Your AI Agents with a Custom RAG Pipeline Powered by Live Web Data

#programming #webscraping #ai #python

Just think for a while, what if you could fed any web page data to your AI agent, to just get you the exact info, answer or the summary of the content you're looking?

Actually, you can that with ease with Scrapy + Zyte API

Meet Fab 👨‍💻
Fab’s a dev with years of experience. Lately, he’s been diving into finance, learning about promising stocks. But here’s the problem: keeping up with daily news, press releases, scrolling through 10 articles and updates every morning is hectic and manual.

So Fab decided to build an AI Agent that does it for him - fetching, reading, and summarizing everything in real time.

That’s basically a custom RAG pipeline, powered by live web data & no longer limited to static PDFs or outdated docs.

Why? bother
Because even the smartest AI agent is only as good as the data it can access:

LLMs have knowledge cutoffs
Real-time, domain-specific data (like finance) is crucial for decision making

By tapping into live web data, Fab’s agent can keep up with the world as it happens - always relevant, always ready.

But hold up ✋, summarizing/ answering isn't the same as taking reaal actions. That’s where AI Agents and Agentic AI differs.

AI Agents are software systems designed to automate specific, well defined tasks, like chatbots, email sorting tools, or voice assistants, usually based on predefined tools or prompts.
Agentic AI, on the other hand, has a broader scope of autonomy.

What we’ll walk through here is technically an AI Agent, but since both share the same foundation, it could evolve into Agentic AI.

Fab's Toolkit 🛠️

Scrapy → for structured data extraction
Zyte API → to handle dynamic & complex websites
DuckDuckGo + yfinance → for extra search and finance insights
Agno → to orchestrate a multi-agent workflow
GroqCloud → lightning fast LLM inference

The Architecture 🏗️

Why Scrapy + Zyte API?

You could try doing this with just Scrapy and rotating proxies. But anyone who has scraped at scale knows the pain: blocks, captchas, failed requests.

That’s where Zyte API shines. It offloads the heavy lifting, so you don’t have to babysit your scrapers, you just get clean, structured data.

Think of it like having a dedicated backend team making sure your spiders never get stuck.

Data Collection the Right Way! 📥

Instead of scraping everything, Fab’s agent first collects URLs only... then fetches only the important data based on a trend score.

To handle this efficiently, Fab designed a Scrapy project with one base spider and four specialized spiders for fetching:

News
Press releases
Transcripts
Comments

The base spider takes care of site specific scraping by:

Fetching URLs and metadata
Cleaning and normalizing dates
Generating unique IDs from URLs

import scrapy
from urllib.parse import urlparse, urlunparse
from datetime import datetime, timedelta
import hashlib

class BaseFinanceSpider(scrapy.Spider):
    name = "base_finance_spider"
    allowed_domains = ["finance-example.com"]

    def clean_url(self, url):
        """Normalize URLs"""
        parsed = urlparse(url)
        return urlunparse(parsed._replace(query="", fragment=""))

    def create_id(self, url):
        """Generate unique ID from cleaned URL"""
        return hashlib.sha256(self.clean_url(url).encode()).hexdigest()

    def convert_date(self, raw_date, now=None):
        """Convert relative dates like 'Today' or 'Yesterday' to ISO"""
        now = now or datetime.now()
        if "Yesterday" in raw_date:
            return (now - timedelta(days=1)).isoformat()
        if "Today" in raw_date:
            return now.isoformat()
        # For demo, we skip complex parsing
        return raw_date

Each specialized spider inherits from the base spider and focuses on site specific logic: navigating pages and extracting the key information for its data type.

At this stage, three of the specialized spiders collect only URLs and metadata, creating a JSON list for each data type. Comments are the exception, we scrape those right away. Think of it as preparing a “to do list” of pages for Fab’s agent to process later, keeping things organized and efficient.

When items are yielded, Scrapy Pipelines automatically handles the cross cutting tasks like URL normalization and ID assignment, deduplication, anonymization, comment linking, and saving items to JSON.

class UrlNormalizationPipeline:
    def process_item(self, item, spider):
        item['url'] = spider.clean_url(item.get('url'))
        item['id'] = spider.create_id(item['url'])
        return item

class DeduplicationPipeline:
    def __init__(self):
        self.seen_ids = set()
    def process_item(self, item, spider):
        if item['id'] in self.seen_ids:
            raise DropItem(f"Duplicate: {item['id']}")
        self.seen_ids.add(item['id'])
        return item

class AnonymizationPipeline:
    def process_item(self, item, spider):
        # Mask authors, publishers, or usernames
        return item

class JsonFileExportPipeline:
    def process_item(self, item, spider):
        # Save item to JSON file (with intermediate saves)
        return item

What Each Spider Produces →

Once the URLs and metadata are collected, Fab’s agent performs trend analysis, using comments as a central indicator to prioritize which pages to fetch in full.

Trend Analysis 📈

Now that we’ve gathered articles ( news, press releases, transcripts ) and comments, the next step is figuring out which topics are actually trending. Collecting raw content is only half the job, what makes it valuable is knowing where attention is going.

For this, we built a Trend Calculator. Its job is to take all the articles and comments we collected, connect them together, and then assign each article a trend score. The score is based on a few simple but powerful signals:

Comment activity – Articles with more comments get higher scores (up to a cap, so one viral post doesn’t skew everything).
Mentions inside comments – If people are discussing one article inside the comments of another, that’s a sign of influence.
Freshness – Recent articles get a bonus since trends fade quickly over time.
Cross source validation – If the same topic shows up across multiple sources (like news and press releases), it’s likely important.
Engagement quality – Longer, more thoughtful comments add extra weight compared to short ones.

# Example of scoring logic
comment_score = 3 * min(article.get('comment_count', 0), 10)  
mention_score = 2 * min(article.get('comment_mentions', 0), 5)  
date_score = calculate_date_bonus(article.get('date'))  
source_score = 2 if len(article.get('sources', [])) > 1 else 0  
engagement_score = quality_from_comments(article.get('comments', []))  

trend_score = comment_score + mention_score + date_score + source_score + engagement_score

Each factor contributes points that add up to a final trend_score, showing how much traction an article has.

Here’s the flow:

Link comments to articles – Attach every comment to its article.

self.article_comments[article_id].append(comment)
article['comment_count'] = len(self.article_comments[article_id])

Score calculation - For every article, the calculator looks at the signals above and assigns points.
Ranking - Articles are sorted by score so we can clearly see which ones are rising in popularity.
Filtering - We keep only those above a threshold score (say 5 or 10), to cut out noise.

Finally, the output is saved as JSON for later use:

trending = tc.get_top_articles(threshold=5.0, limit=100)
with open('trending_articles.json', 'w') as f:
    json.dump(trending, f, indent=2)

At this point, we’re not just storing articles list, we’re turning them into insights about what’s gaining traction in real time.

The output of this step is a trending_articles.json file: a ranked list of articles with their comment signals attached. Next, we’ll take this list and extract the full article content for deeper processing.

Processing the Articles 📑

Alright, time to move past the signals and actually grab the articles content. This is where Fab’s agent pulls in the full text so it can finally be read, summarized, and acted on — the real scraping and processing begins here.

Step 1: Smart Extraction with Zyte API
Instead of scraping blindly, we run each article URL through Zyte API. It tries multiple strategies under the hood:

Browser rendering for rich pages.
HTTP response fallback if the first pass fails.
And if all else fails → a graceful fallback object that notes the article couldn’t be extracted (paywalls, login walls, etc.).

Caching is baked in so we don’t re download the same article twice.

def extract_article_with_zyte_api(url):
    if is_cached(url):
        return get_from_cache(url)
    # Try browser mode first, fallback to HTTP
    for method in [extract_with_browser_simple, extract_with_http_response]:
        article = method(url)
        if article and len(article.get("content", "")) > 100:
            save_to_cache(url, article)
            return article
    return create_fallback_article(url)

Step 2: Batch Processing ⚡

Its not good to hammer a site with 50 requests at once, so Fab’s agent scrapes articles in small batches. This keeps things stable, avoids rate limits, and lets us resume midway if anything fails.

scraped_articles = process_articles_in_batches(urls, batch_size=3)

Step 3: Comments + Anonymization
Once the raw articles are in, we attach their associated comments (collected earlier) and anonymize usernames. That way, Fab can see the discussion signals without worrying about leaking personal data.

article['comments'] = matching_trending.get('comments', [])
article = processing_anonymizer.anonymize_comments_in_article(article)

Step 4: Summarization with LLMs

Finally, each article is summarized using Groq + Llama 3.3, with comments included in the context. The prompt ensures Fab gets:

A clear content-type tag ([Complete article with comments], [Partial article], etc.).
The main points of the article.
Highlights from user comments (agreements, debates, sentiment).
A note if the article looked incomplete or truncated.

summary = processor.summarize_article(article)

At this point, we’ve gone from:
just links + scores → full articles + anonymization + structured summaries.

This is the real handoff moment: the dataset is now clean, safe, and AI ready. Time to combine this with other data sources...

Turning Raw Summaries into Something Useful

So we’ve got cleaned up, summarized articles sitting neatly in JSON. That’s cool, but Fab doesn’t just want a folder full of summaries, he wants an agent that can reason over them, combine them with live market data, and give him answers on demand.

That’s exactly what Agno will be used for. Agno is a framework for building LLM powered agents where everything revolves around tools. We use some ready made tools, like yfinance for market data or DuckDuckGo for quick searches, and we’ll create our own custom tool using the scraped and summarized articles we’ve collected.

Step 1: Custom Data as a Tool

We wrap our summaries into a CustomDataTools class. This behaves just like any other tool in Fab’s agent, except instead of calling an external API, it pulls directly from our private dataset of scraped articles.

Load summaries from article_summaries.json file.
Filter them by stock ticker (NVDA in our case)
Format them into a neat digest with truncation rules so we don’t blow past token limits

class CustomDataTools(Toolkit):
    def get_custom_financial_summaries(self, stock_ticker="NVDA"):
        summaries = self.get_scraped_summaries()
        return self.format_summaries(summaries, stock_ticker=stock_ticker)

Step 2: Mixing with External Data

Of course, Fab doesn’t live on summaries alone. He still needs real time signals like stock prices, analyst ratings, and other fresh search. That’s where we combine:

yfinance → live stock + fundamentals
DuckDuckGo → fresh search
Our custom summaries → curated, domain-specific insights Now the agent has both breadth (search + finance APIs) and depth (our private dataset).

Step 3: Building the Agent 🧑‍💻
With Agno, stitching it together is dead simple:

Pick a model (Groq’s Llama 3.3 for speed, or Ollama locally if Fab prefers).
Load the toolset (custom data first, then finance APIs, then search).
Add guardrails: focus on NVDA, prefer bullet points, flag stale data, cite sources.

# 1️⃣ Import models and tools
from agno.agent import Agent
from agno.models.groq import Groq
from agno.models.ollama import Ollama
from agno.tools import Toolkit
from agno.tools.yfinance import YFinanceTools
from agno.tools.duckduckgo import DuckDuckGoTools

# 2️⃣ Define a custom tool for our scraped summaries ( given above ) 
class CustomDataTools(Toolkit):
    ...

# 3️⃣ Configure agent tools
def get_tools():
    return [
        CustomDataTools(),          # Priority: private scraped data
        YFinanceTools(),            # Priority: live stock & fundamentals
        DuckDuckGoTools()           # Priority: fresh search results
    ]

# 4️⃣ Pick a model
model = Groq(id="llama-3.3-70b-versatile")   # Or Ollama if you prefer local

# 5️⃣ Create the unified agent
finance_agent = Agent(
    name="Fab Finance Agent",
    model=model,
    tools=get_tools(),
    instructions=[
        "Focus on NVDA",
        "Prioritize custom summaries first, then live stock data, then fresh search",
        "Provide actionable insights in bullet points",
        "Cite sources and flag outdated info"
    ],
    show_tool_calls=True,
    markdown=True
)

Now when Fab asks:

“What’s the latest chatter around NVDA this week?”

The agent first checks our curated summaries, then layers in stock stats and fresh news.

This is where everything comes together:

Scrapy + Zyte API → fresh, structured raw data
Processing & scoring → signal + summaries
Finance Agent ( Agno ) → fusing custom + external tools into one workflow

Conclusion

What Fab ends up with is not just a scraper or a summarizer but a finance co pilot that stays current, context aware, and grounded in real web data.

With this workflow, what started as a manual, time-consuming task has transformed into a seamless, intelligent system, proving just how powerful AI Agents can be when paired with live web data.

💡 A small challenge for you:
If you’re feeling adventurous, try taking this project a step further, convert Fab’s AI Agent into a fully Agentic AI that can make decisions for you (of course, only with your approval, or you might risk your investments 😅). Connect it with the MCP of your stockbroker many of them provide one nowadays and scale it into something truly powerful, a next level finance companion!

If you get stuck or need guidance, don’t worry. Head over to the Extract Data Community, where 21,000+ data enthusiasts are ready to jump in and help you with your questions.

Dive in, experiment, and let us see your next move! 🙂

Thanks for reading!

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.