"HN is amazing for discovery, terrible for structured research."
If you hang out on Hacker News, you know the feeling: you see a great thread, think "I should come back to this", and… never do. A week later, you're trying to answer a question like:
- "How has HN's opinion on Rust vs Go changed over time?"
- "What does HN actually think about LangChain-style agent frameworks?"
HN's built-in search is fine for keywords, but not for questions about themes, opinions, and trends.
What we really want is to ask higher-level questions about topics, threads, and time windows like:
- "Show me discussions about
e.g. Rustin the last 6 months." - "Compare how
remote workwas discussed in 2021 vs 2024." - "Summarize the main arguments for and against
LLM agentsacross top HN threads."
That's where fenic comes in: think of it as a dataframe + context layer built for LLM-powered analysis. You declare what data you care about, use regular + semantic transforms to shape it, and then plug that into an agent loop.
This post walks through how we use fenic to turn a raw Hacker News dataset into a small but powerful "deep research" agent.
👉 Full project:
- fenic repo: https://github.com/typedef-ai/fenic
- Example code: https://github.com/typedef-ai/fenic-examples/tree/main/hn_agent
- HN dataset: https://huggingface.co/datasets/typedef-ai/hacker-news-dataset
2. What we'll build
We'll build a small research agent that:
- Loads an HN dataset (stories, comments, metadata).
- Lets you filter and slice discussions by topic, time, and signals.
- Uses LLMs to summarize, compare, and extract themes from those slices.
- Wraps it all in a simple loop: user question → fenic dataframe query → LLM analysis → answer + links back to HN.
Conceptually, the pipeline looks like this:
- Data layer: fenic DataFrames over the HN dataset.
- Query layer: reusable "research queries" expressed as fenic transformations.
- LLM layer: fenic semantic operators (and/or UDFs) to summarize/compare.
- Agent loop: something like PydanticAI or your framework of choice to orchestrate.
3. Setting up fenic
You'll need:
- Python 3.10+
- An LLM provider key (OpenAI, Anthropic, etc.)
- uv package manager
git
Install with uv
# Clone the example repo
git clone https://github.com/typedef-ai/fenic-examples.git
cd fenic-examples/hn_agent
# Install dependencies with uv
uv sync
Set your LLM API key(s) and HuggingFace token:
export OPENAI_API_KEY="sk-..."
export HF_TOKEN="your-huggingface-token"
Inside this folder you'll find:
- A notebook / script that wires together fenic + PydanticAI.
- Helper functions to load the Hacker News dataset.
- A simple agent loop that you can run locally.
If you just want to run it and poke around, start there. The rest of this post explains the pieces.
4. Loading the Hacker News data
The dataset we're using is published as a public Hugging Face Dataset:
👉 https://huggingface.co/datasets/typedef-ai/hacker-news-dataset
At a high level it contains:
-
Stories:
id,title,url,by,time,score, etc. -
Comments:
id,parent,story_id,by,time,text -
Metadata: type (
story,comment), deleted flags, etc.
Here's how the actual data loader works in the hn_agent project:
from hn_agent.session import get_session
def load_hn_data(verbose: bool = True) -> None:
"""Load all Hacker News data from HuggingFace into local tables."""
session = get_session()
base_path = "hf://datasets/typedef-ai/hacker-news-dataset/data"
# All 2025 data files to load
files_to_tables = {
"2025_comments": "comments",
"2025_items": "items",
"2025_stories": "stories",
"2025_jobs": "jobs",
"2025_polls": "polls",
"2025_pollopts": "pollopts",
"2025_users": "users",
"2025_user_submissions": "user_submissions",
"2025_item_children": "item_children",
"2025_item_parts": "item_parts"
}
# Load each file into its own table
for file_name, table_name in files_to_tables.items():
if verbose:
print(f"Loading {file_name}...")
df = session.read.parquet(f"{base_path}/{file_name}.parquet")
df.write.save_as_table(table_name, mode="overwrite")
if verbose:
count = df.count()
print(f" ✓ Loaded {count:,} records into {table_name}")
The session is configured with semantic support:
from fenic.api.session import Session
from fenic.api.session.config import SessionConfig, SemanticConfig, OpenAILanguageModel
config = SessionConfig(
app_name="hn_agent",
db_path=str(data_dir),
semantic=SemanticConfig(
language_models={
"gpt4": OpenAILanguageModel(
model_name="gpt-5-nano",
rpm=100,
tpm=100000
)
}
)
)
session = Session.get_or_create(config)
To run the data loading:
HF_TOKEN=$HF_TOKEN uv run python -m hn_agent.data.loader
This downloads ~2.5M comments and ~500K stories from 2025 into a local fenic database. The loader also denormalizes data into optimized lookup tables (comment_to_story, story_threads, story_discussions) that eliminate recursive SQL queries during tool execution.
5. Defining research queries
Instead of hard-coding one-off scripts, we treat "research questions" as reusable dataframe transformations exposed as MCP tools.
Here's how the search tool is registered in the actual project:
import fenic.api.functions as fc
from fenic.core.types.datatypes import StringType
from fenic.core.mcp.types import ToolParam
def register_story_search_tool(session, tool_name: str = "search_stories") -> None:
"""Register a regex-based HN story search tool."""
catalog = session.catalog
# Get tables
items = session.table("items").filter(fc.col("type") == fc.lit("story"))
comments = session.table("comments")
comment_to_story = session.table("comment_to_story") # Denormalized table
# Tool parameters
pattern = fc.tool_param("pattern", StringType)
# Story-side matches
title_match = fc.coalesce(fc.col("title"), fc.lit("")).rlike(pattern)
url_match = fc.coalesce(fc.col("url"), fc.lit("")).rlike(pattern)
story_text_match = fc.coalesce(fc.col("text"), fc.lit("")).rlike(pattern)
story_hits = (
items.with_column("title_match", title_match)
.with_column("url_match", url_match)
.with_column("text_match", story_text_match)
.with_column(
"match_rank",
fc.when(fc.col("title_match"), fc.lit(1))
.when(fc.col("url_match"), fc.lit(2))
.when(fc.col("text_match"), fc.lit(3))
.otherwise(fc.lit(999)),
)
.filter(fc.col("title_match") | fc.col("url_match") | fc.col("text_match"))
.select(
fc.col("id").alias("story_id"),
fc.col("title"),
fc.col("by").alias("author"),
fc.col("time").alias("published_at"),
fc.col("score"),
fc.col("descendants").alias("comment_count"),
fc.col("url"),
fc.col("match_rank"),
)
)
# Comment-side matches - use denormalized lookup table (no recursion!)
comment_text_match = fc.coalesce(fc.col("text"), fc.lit("")).rlike(pattern)
matched_comments = (
comments
.filter(comment_text_match)
.select(fc.col("id").alias("comment_id"))
.limit(5000)
)
# Fast lookup using denormalized comment_to_story table
comment_stories = (
matched_comments
.join(comment_to_story, on="comment_id")
.select(fc.col("story_id"))
.drop_duplicates(["story_id"])
)
# Combine and rank results
unified = story_hits.union(comment_hits).drop_duplicates(["story_id"])
sorted_results = unified.sort([fc.col("match_rank"), fc.col("published_at").desc()])
# Register the tool
catalog.create_tool(
tool_name=tool_name,
tool_description="Search Hacker News stories using regex patterns...",
tool_query=sorted_results,
tool_params=[
ToolParam(name="pattern", description="Regular expression pattern to search for.")
],
result_limit=100,
)
Results are ranked by relevance:
- Title matches (most relevant)
- URL matches
- Story text matches
- Comment matches
6. Adding LLM-powered analysis
fenic has first-class semantic operators that wrap LLM calls as dataframe operations (with batching, retries, cost tracking, etc.). That lets you say:
"For each group of comments, ask the model to summarize / classify / extract structure."
Summarize threads with structured output
Here's how the summarize_story tool works with Pydantic models:
from pydantic import BaseModel, Field
from typing import List
import fenic.api.functions.semantic as semantic
class DiscussionTheme(BaseModel):
"""Represents a theme or topic within a discussion."""
topic: str = Field(description="Name of the discussion theme")
summary: str = Field(description="Concise summary of the theme, viewpoints, and evidence")
stance_spectrum: str = Field(default="", description="How opinions vary across this theme")
representative_comment_ids: List[int] = Field(
default_factory=list, description="Example comment IDs relevant to this theme"
)
off_topic: bool = Field(default=False, description="True if this theme is off-topic")
class StorySummary(BaseModel):
"""Structured summary of a Hacker News story and its discussion."""
tl_dr: str = Field(description="Two-sentence top summary")
story_overview: str = Field(description="Short overview of the story itself")
key_points: List[str] = Field(default_factory=list, description="Key points and takeaways")
discussion_themes: List[DiscussionTheme] = Field(
default_factory=list, description="Themes across the discussion"
)
variety_present: bool = Field(description="Whether discussion splits into distinct topics")
off_topic_themes: List[str] = Field(default_factory=list, description="Names of off-topic themes")
risks_or_concerns: List[str] = Field(default_factory=list, description="Risks or concerns raised")
actionables: List[str] = Field(default_factory=list, description="Any concrete action items")
sources: List[int] = Field(default_factory=list, description="Referenced comment IDs")
truncated_input: bool = Field(description="True if input was truncated due to size")
The summarization uses fenic's semantic.map with a structured prompt:
CONCISE_PROMPT = """Summarize this Hacker News discussion in {{ language }}:
Story: {{ title }} ({{ domain }})
URL: {{ url }}
Score: {{ score }}, Comments: {{ descendants }}
Published: {{ published_at }}
Discussion thread:
{{ transcript }}
Create a structured summary including:
1. TL;DR (max 2 sentences)
2. Story overview (brief)
3. Key points from discussion
4. Main discussion themes with viewpoints and stances
5. Off-topic themes if present
6. Risks/concerns raised
7. Action items mentioned
{{ extra_instructions }}"""
summary_col = semantic.map(
CONCISE_PROMPT,
response_format=StorySummary,
model_alias=model_alias,
temperature=0.0,
max_output_tokens=768,
title=fc.col("title"),
url=fc.col("url"),
domain=fc.col("domain"),
published_at=fc.col("published_at"),
score=fc.col("score"),
descendants=fc.col("descendants"),
transcript=fc.col("transcript_limited"),
extra_instructions=fc.col("extra"),
language=fc.col("lang"),
)
with_summary = with_discussion.select(
fc.col("story_id"),
fc.col("title"),
# ... other columns
summary_col.alias("summary"),
)
Now each row has a typed summary object you can:
- Access nested fields directly
- Aggregate stance ratios by year
- Join back to scores, authors, etc.
7. Putting it together as an agent loop
fenic takes care of data + context. To make this interactive, we wrap it in a small agent loop using PydanticAI.
Here's the actual research agent from the project:
from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_ai.mcp import MCPServerStreamableHTTP
from typing import Any, Dict, List
class DeepResearchReport(BaseModel):
"""Structured output for research findings."""
question: str
method: List[str] = Field(default_factory=list, description="Methods used to research")
key_findings: List[str] = Field(default_factory=list, description="Main discoveries")
themes: List[Dict[str, Any]] = Field(default_factory=list, description="Common themes across stories")
controversies: List[str] = Field(default_factory=list, description="Points of disagreement")
sources: List[Dict[str, Any]] = Field(default_factory=list, description="Story IDs and titles")
limitations: List[str] = Field(default_factory=list, description="Research limitations")
SYSTEM_PROMPT = """
You are a deep research agent analyzing Hacker News discussions via MCP tools.
Available tools:
- search_stories(pattern): Find stories matching a regex pattern
- summarize_story(story_id): Get AI summary of a story and its discussion
- read_story(story_id): Get full story with comment tree (use sparingly)
Research process:
1. Use search_stories to find relevant content (max 5 searches, limit 10 per search)
2. Use summarize_story on the most relevant stories
3. Only use read_story if you need specific metadata not in summaries
4. Synthesize findings across all stories
Important:
- Keep search patterns broad initially, then refine
- Always cite story IDs in your findings
- Don't paste raw tool outputs into context
- Focus on patterns and insights across multiple stories
Return a JSON object matching the DeepResearchReport schema.
"""
async def run_research_async(question: str, max_stories_to_summarize: int = 8) -> DeepResearchReport:
"""Run deep research on a Hacker News topic."""
mcp_url = os.getenv("HN_MCP_URL", "http://localhost:8080/mcp")
# Create MCP connection
mcp_server = MCPServerStreamableHTTP(url=mcp_url)
# Create agent with structured output
agent = Agent(
"openai:gpt-5",
system_prompt=SYSTEM_PROMPT,
toolsets=[mcp_server],
output_type=DeepResearchReport,
output_retries=2
)
# Build user prompt
user_prompt = f"""Research question: {question}
Please investigate this topic across Hacker News stories and discussions.
Budget: max 5 searches, summarize up to {max_stories_to_summarize} stories.
Focus on finding diverse perspectives and recurring themes."""
result = await agent.run(user_prompt)
return result.output
The MCP server that exposes the tools is simple:
from fenic.api.mcp.server import create_mcp_server, run_mcp_server_sync
from hn_agent.session import get_session
from hn_agent.tools.tools import register_tools
def start_server(port: int = 8080) -> None:
"""Start the HTTP MCP server."""
session = get_session()
# Register tools first with the same session
register_tools(session=session)
# Get all tools from catalog
catalog = session.catalog
tools = catalog.list_tools()
# Create the MCP server with tools
server = create_mcp_server(
session=session,
server_name="hn_agent",
tools=tools
)
# Run the server with HTTP transport
print(f"Starting MCP server on http://localhost:{port}")
run_mcp_server_sync(
server=server,
transport="http",
port=port
)
To run everything:
# Terminal 1: Start MCP server
uv run python -m hn_agent.mcp.server
# Terminal 2: Run research queries
OPENAI_API_KEY=$OPENAI_API_KEY uv run python -m hn_agent.cli "What are concerns about AI safety?"
# With options
OPENAI_API_KEY=$OPENAI_API_KEY uv run python -m hn_agent.cli --max-stories 10 "Latest LLM developments"
# Output as JSON
OPENAI_API_KEY=$OPENAI_API_KEY uv run python -m hn_agent.cli --json "Rust vs Go discussions"
The pattern is the same:
Dataframe slice → semantic transforms → agent consumes the results.
8. Where to go next
Everything here is just one concrete instantiation of a more general pattern:
-
Swap in your own datasets
- Company forum threads
- Support tickets
- Slack exports
- Internal RFCs / design docs
-
Reuse the same fenic primitives
- Filter/slice on metadata (teams, product areas, time windows).
- Use
semantic.mapwith Pydantic models for structured extraction. - Use
semantic.extractfor pulling typed data from text. - Use
fc.tool_paramto create parameterized MCP tools.
-
Combine with other fenic examples
- Use the semantic join examples to correlate HN threads with logs, incidents, or docs.
- Use the clustering capabilities to group similar discussions together.
- Use the Hugging Face Datasets integration to hydrate other versioned datasets into DataFrames with one line.
If you want to dig deeper:
- fenic core: https://github.com/typedef-ai/fenic
- This HN agent example: https://github.com/typedef-ai/fenic-examples/tree/main/hn_agent
- Hacker News dataset on HF: https://huggingface.co/datasets/typedef-ai/hacker-news-dataset
I'd love to hear how you adapt this pattern, whether it's for HN, your company's internal knowledge, or other messy discussion data.
Top comments (0)