Kostas Pardalis

Posted on Nov 17 • Originally published at typedef.ai

How to Build a Deep Research Agent with Pydantic AI

#ai #tutorial #agents #python

"HN is amazing for discovery, terrible for structured research."

If you hang out on Hacker News, you know the feeling: you see a great thread, think "I should come back to this", and… never do. A week later, you're trying to answer a question like:

"How has HN's opinion on Rust vs Go changed over time?"
"What does HN actually think about LangChain-style agent frameworks?"

HN's built-in search is fine for keywords, but not for questions about themes, opinions, and trends.

What we really want is to ask higher-level questions about topics, threads, and time windows like:

"Show me discussions about e.g. Rust in the last 6 months."
"Compare how remote work was discussed in 2021 vs 2024."
"Summarize the main arguments for and against LLM agents across top HN threads."

That's where fenic comes in: think of it as a dataframe + context layer built for LLM-powered analysis. You declare what data you care about, use regular + semantic transforms to shape it, and then plug that into an agent loop.

This post walks through how we use fenic to turn a raw Hacker News dataset into a small but powerful "deep research" agent.

👉 Full project:

fenic repo: https://github.com/typedef-ai/fenic
Example code: https://github.com/typedef-ai/fenic-examples/tree/main/hn_agent
HN dataset: https://huggingface.co/datasets/typedef-ai/hacker-news-dataset

2. What we'll build

We'll build a small research agent that:

Loads an HN dataset (stories, comments, metadata).
Lets you filter and slice discussions by topic, time, and signals.
Uses LLMs to summarize, compare, and extract themes from those slices.
Wraps it all in a simple loop: user question → fenic dataframe query → LLM analysis → answer + links back to HN.

Conceptually, the pipeline looks like this:

Data layer: fenic DataFrames over the HN dataset.
Query layer: reusable "research queries" expressed as fenic transformations.
LLM layer: fenic semantic operators (and/or UDFs) to summarize/compare.
Agent loop: something like PydanticAI or your framework of choice to orchestrate.

3. Setting up fenic

You'll need:

Python 3.10+
An LLM provider key (OpenAI, Anthropic, etc.)
uv package manager
git

Install with uv

# Clone the example repo
git clone https://github.com/typedef-ai/fenic-examples.git
cd fenic-examples/hn_agent

# Install dependencies with uv
uv sync

Set your LLM API key(s) and HuggingFace token:

export OPENAI_API_KEY="sk-..."
export HF_TOKEN="your-huggingface-token"

Inside this folder you'll find:

A notebook / script that wires together fenic + PydanticAI.
Helper functions to load the Hacker News dataset.
A simple agent loop that you can run locally.

If you just want to run it and poke around, start there. The rest of this post explains the pieces.

4. Loading the Hacker News data

The dataset we're using is published as a public Hugging Face Dataset:

👉 https://huggingface.co/datasets/typedef-ai/hacker-news-dataset

At a high level it contains:

Stories: id, title, url, by, time, score, etc.
Comments: id, parent, story_id, by, time, text
Metadata: type (story, comment), deleted flags, etc.

Here's how the actual data loader works in the hn_agent project:

from hn_agent.session import get_session

def load_hn_data(verbose: bool = True) -> None:
    """Load all Hacker News data from HuggingFace into local tables."""
    session = get_session()
    base_path = "hf://datasets/typedef-ai/hacker-news-dataset/data"

    # All 2025 data files to load
    files_to_tables = {
        "2025_comments": "comments",
        "2025_items": "items",
        "2025_stories": "stories",
        "2025_jobs": "jobs",
        "2025_polls": "polls",
        "2025_pollopts": "pollopts",
        "2025_users": "users",
        "2025_user_submissions": "user_submissions",
        "2025_item_children": "item_children",
        "2025_item_parts": "item_parts"
    }

    # Load each file into its own table
    for file_name, table_name in files_to_tables.items():
        if verbose:
            print(f"Loading {file_name}...")

        df = session.read.parquet(f"{base_path}/{file_name}.parquet")
        df.write.save_as_table(table_name, mode="overwrite")

        if verbose:
            count = df.count()
            print(f"  ✓ Loaded {count:,} records into {table_name}")

The session is configured with semantic support:

from fenic.api.session import Session
from fenic.api.session.config import SessionConfig, SemanticConfig, OpenAILanguageModel

config = SessionConfig(
    app_name="hn_agent",
    db_path=str(data_dir),
    semantic=SemanticConfig(
        language_models={
            "gpt4": OpenAILanguageModel(
                model_name="gpt-5-nano",
                rpm=100,
                tpm=100000
            )
        }
    )
)
session = Session.get_or_create(config)

To run the data loading:

HF_TOKEN=$HF_TOKEN uv run python -m hn_agent.data.loader

This downloads ~2.5M comments and ~500K stories from 2025 into a local fenic database. The loader also denormalizes data into optimized lookup tables (comment_to_story, story_threads, story_discussions) that eliminate recursive SQL queries during tool execution.

5. Defining research queries

Instead of hard-coding one-off scripts, we treat "research questions" as reusable dataframe transformations exposed as MCP tools.

Here's how the search tool is registered in the actual project:

import fenic.api.functions as fc
from fenic.core.types.datatypes import StringType
from fenic.core.mcp.types import ToolParam

def register_story_search_tool(session, tool_name: str = "search_stories") -> None:
    """Register a regex-based HN story search tool."""
    catalog = session.catalog

    # Get tables
    items = session.table("items").filter(fc.col("type") == fc.lit("story"))
    comments = session.table("comments")
    comment_to_story = session.table("comment_to_story")  # Denormalized table

    # Tool parameters
    pattern = fc.tool_param("pattern", StringType)

    # Story-side matches
    title_match = fc.coalesce(fc.col("title"), fc.lit("")).rlike(pattern)
    url_match = fc.coalesce(fc.col("url"), fc.lit("")).rlike(pattern)
    story_text_match = fc.coalesce(fc.col("text"), fc.lit("")).rlike(pattern)

    story_hits = (
        items.with_column("title_match", title_match)
        .with_column("url_match", url_match)
        .with_column("text_match", story_text_match)
        .with_column(
            "match_rank",
            fc.when(fc.col("title_match"), fc.lit(1))
            .when(fc.col("url_match"), fc.lit(2))
            .when(fc.col("text_match"), fc.lit(3))
            .otherwise(fc.lit(999)),
        )
        .filter(fc.col("title_match") | fc.col("url_match") | fc.col("text_match"))
        .select(
            fc.col("id").alias("story_id"),
            fc.col("title"),
            fc.col("by").alias("author"),
            fc.col("time").alias("published_at"),
            fc.col("score"),
            fc.col("descendants").alias("comment_count"),
            fc.col("url"),
            fc.col("match_rank"),
        )
    )

    # Comment-side matches - use denormalized lookup table (no recursion!)
    comment_text_match = fc.coalesce(fc.col("text"), fc.lit("")).rlike(pattern)
    matched_comments = (
        comments
        .filter(comment_text_match)
        .select(fc.col("id").alias("comment_id"))
        .limit(5000)
    )

    # Fast lookup using denormalized comment_to_story table
    comment_stories = (
        matched_comments
        .join(comment_to_story, on="comment_id")
        .select(fc.col("story_id"))
        .drop_duplicates(["story_id"])
    )

    # Combine and rank results
    unified = story_hits.union(comment_hits).drop_duplicates(["story_id"])
    sorted_results = unified.sort([fc.col("match_rank"), fc.col("published_at").desc()])

    # Register the tool
    catalog.create_tool(
        tool_name=tool_name,
        tool_description="Search Hacker News stories using regex patterns...",
        tool_query=sorted_results,
        tool_params=[
            ToolParam(name="pattern", description="Regular expression pattern to search for.")
        ],
        result_limit=100,
    )

Results are ranked by relevance:

Title matches (most relevant)
URL matches
Story text matches
Comment matches

6. Adding LLM-powered analysis

fenic has first-class semantic operators that wrap LLM calls as dataframe operations (with batching, retries, cost tracking, etc.). That lets you say:

"For each group of comments, ask the model to summarize / classify / extract structure."

Summarize threads with structured output

Here's how the summarize_story tool works with Pydantic models:

from pydantic import BaseModel, Field
from typing import List
import fenic.api.functions.semantic as semantic

class DiscussionTheme(BaseModel):
    """Represents a theme or topic within a discussion."""
    topic: str = Field(description="Name of the discussion theme")
    summary: str = Field(description="Concise summary of the theme, viewpoints, and evidence")
    stance_spectrum: str = Field(default="", description="How opinions vary across this theme")
    representative_comment_ids: List[int] = Field(
        default_factory=list, description="Example comment IDs relevant to this theme"
    )
    off_topic: bool = Field(default=False, description="True if this theme is off-topic")


class StorySummary(BaseModel):
    """Structured summary of a Hacker News story and its discussion."""
    tl_dr: str = Field(description="Two-sentence top summary")
    story_overview: str = Field(description="Short overview of the story itself")
    key_points: List[str] = Field(default_factory=list, description="Key points and takeaways")
    discussion_themes: List[DiscussionTheme] = Field(
        default_factory=list, description="Themes across the discussion"
    )
    variety_present: bool = Field(description="Whether discussion splits into distinct topics")
    off_topic_themes: List[str] = Field(default_factory=list, description="Names of off-topic themes")
    risks_or_concerns: List[str] = Field(default_factory=list, description="Risks or concerns raised")
    actionables: List[str] = Field(default_factory=list, description="Any concrete action items")
    sources: List[int] = Field(default_factory=list, description="Referenced comment IDs")
    truncated_input: bool = Field(description="True if input was truncated due to size")

The summarization uses fenic's semantic.map with a structured prompt:

CONCISE_PROMPT = """Summarize this Hacker News discussion in {{ language }}:

Story: {{ title }} ({{ domain }})
URL: {{ url }}
Score: {{ score }}, Comments: {{ descendants }}
Published: {{ published_at }}

Discussion thread:
{{ transcript }}

Create a structured summary including:
1. TL;DR (max 2 sentences)
2. Story overview (brief)
3. Key points from discussion
4. Main discussion themes with viewpoints and stances
5. Off-topic themes if present
6. Risks/concerns raised
7. Action items mentioned

{{ extra_instructions }}"""

summary_col = semantic.map(
    CONCISE_PROMPT,
    response_format=StorySummary,
    model_alias=model_alias,
    temperature=0.0,
    max_output_tokens=768,
    title=fc.col("title"),
    url=fc.col("url"),
    domain=fc.col("domain"),
    published_at=fc.col("published_at"),
    score=fc.col("score"),
    descendants=fc.col("descendants"),
    transcript=fc.col("transcript_limited"),
    extra_instructions=fc.col("extra"),
    language=fc.col("lang"),
)

with_summary = with_discussion.select(
    fc.col("story_id"),
    fc.col("title"),
    # ... other columns
    summary_col.alias("summary"),
)

Now each row has a typed summary object you can:

Access nested fields directly
Aggregate stance ratios by year
Join back to scores, authors, etc.

7. Putting it together as an agent loop

fenic takes care of data + context. To make this interactive, we wrap it in a small agent loop using PydanticAI.

Here's the actual research agent from the project:

from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_ai.mcp import MCPServerStreamableHTTP
from typing import Any, Dict, List

class DeepResearchReport(BaseModel):
    """Structured output for research findings."""
    question: str
    method: List[str] = Field(default_factory=list, description="Methods used to research")
    key_findings: List[str] = Field(default_factory=list, description="Main discoveries")
    themes: List[Dict[str, Any]] = Field(default_factory=list, description="Common themes across stories")
    controversies: List[str] = Field(default_factory=list, description="Points of disagreement")
    sources: List[Dict[str, Any]] = Field(default_factory=list, description="Story IDs and titles")
    limitations: List[str] = Field(default_factory=list, description="Research limitations")


SYSTEM_PROMPT = """
You are a deep research agent analyzing Hacker News discussions via MCP tools.

Available tools:
- search_stories(pattern): Find stories matching a regex pattern
- summarize_story(story_id): Get AI summary of a story and its discussion
- read_story(story_id): Get full story with comment tree (use sparingly)

Research process:
1. Use search_stories to find relevant content (max 5 searches, limit 10 per search)
2. Use summarize_story on the most relevant stories
3. Only use read_story if you need specific metadata not in summaries
4. Synthesize findings across all stories

Important:
- Keep search patterns broad initially, then refine
- Always cite story IDs in your findings
- Don't paste raw tool outputs into context
- Focus on patterns and insights across multiple stories

Return a JSON object matching the DeepResearchReport schema.
"""


async def run_research_async(question: str, max_stories_to_summarize: int = 8) -> DeepResearchReport:
    """Run deep research on a Hacker News topic."""
    mcp_url = os.getenv("HN_MCP_URL", "http://localhost:8080/mcp")

    # Create MCP connection
    mcp_server = MCPServerStreamableHTTP(url=mcp_url)

    # Create agent with structured output
    agent = Agent(
        "openai:gpt-5",
        system_prompt=SYSTEM_PROMPT,
        toolsets=[mcp_server],
        output_type=DeepResearchReport,
        output_retries=2
    )

    # Build user prompt
    user_prompt = f"""Research question: {question}

Please investigate this topic across Hacker News stories and discussions.
Budget: max 5 searches, summarize up to {max_stories_to_summarize} stories.
Focus on finding diverse perspectives and recurring themes."""

    result = await agent.run(user_prompt)
    return result.output

The MCP server that exposes the tools is simple:

from fenic.api.mcp.server import create_mcp_server, run_mcp_server_sync
from hn_agent.session import get_session
from hn_agent.tools.tools import register_tools


def start_server(port: int = 8080) -> None:
    """Start the HTTP MCP server."""
    session = get_session()

    # Register tools first with the same session
    register_tools(session=session)

    # Get all tools from catalog
    catalog = session.catalog
    tools = catalog.list_tools()

    # Create the MCP server with tools
    server = create_mcp_server(
        session=session,
        server_name="hn_agent",
        tools=tools
    )

    # Run the server with HTTP transport
    print(f"Starting MCP server on http://localhost:{port}")
    run_mcp_server_sync(
        server=server,
        transport="http",
        port=port
    )

To run everything:

# Terminal 1: Start MCP server
uv run python -m hn_agent.mcp.server

# Terminal 2: Run research queries
OPENAI_API_KEY=$OPENAI_API_KEY uv run python -m hn_agent.cli "What are concerns about AI safety?"

# With options
OPENAI_API_KEY=$OPENAI_API_KEY uv run python -m hn_agent.cli --max-stories 10 "Latest LLM developments"

# Output as JSON
OPENAI_API_KEY=$OPENAI_API_KEY uv run python -m hn_agent.cli --json "Rust vs Go discussions"

The pattern is the same:

Dataframe slice → semantic transforms → agent consumes the results.

8. Where to go next

Everything here is just one concrete instantiation of a more general pattern:

Swap in your own datasets
- Company forum threads
- Support tickets
- Slack exports
- Internal RFCs / design docs
Reuse the same fenic primitives
- Filter/slice on metadata (teams, product areas, time windows).
- Use semantic.map with Pydantic models for structured extraction.
- Use semantic.extract for pulling typed data from text.
- Use fc.tool_param to create parameterized MCP tools.
Combine with other fenic examples
- Use the semantic join examples to correlate HN threads with logs, incidents, or docs.
- Use the clustering capabilities to group similar discussions together.
- Use the Hugging Face Datasets integration to hydrate other versioned datasets into DataFrames with one line.