<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ian Taylor</title>
    <description>The latest articles on DEV Community by Ian Taylor (@ianbuildsagents).</description>
    <link>https://dev.to/ianbuildsagents</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3982239%2Ffd5a8715-d3ae-476e-9d6a-c1353255252e.png</url>
      <title>DEV Community: Ian Taylor</title>
      <link>https://dev.to/ianbuildsagents</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ianbuildsagents"/>
    <language>en</language>
    <item>
      <title>How I Built an AI Journalist Discovery Engine with Octoparse MCP</title>
      <dc:creator>Ian Taylor</dc:creator>
      <pubDate>Sat, 13 Jun 2026 07:16:14 +0000</pubDate>
      <link>https://dev.to/ianbuildsagents/how-i-built-an-ai-journalist-discovery-engine-with-octoparse-mcp-3im</link>
      <guid>https://dev.to/ianbuildsagents/how-i-built-an-ai-journalist-discovery-engine-with-octoparse-mcp-3im</guid>
      <description>&lt;p&gt;Most people connect Octoparse MCP to their AI assistant and use it to extract a product list or pull some prices into a table.&lt;/p&gt;

&lt;p&gt;That's fine. But I wanted to use it differently.&lt;/p&gt;

&lt;p&gt;I wanted Octoparse MCP to act as a live structured intelligence agent — something my AI system could call on demand, in real time, every time a user submits a story or a press angle. Not a batch job. Not a scheduled pipeline. A live tool call that returns clean, structured journalist profiles directly into an LLM scoring engine.&lt;/p&gt;

&lt;p&gt;That's what I built with E_MediaScience.&lt;/p&gt;

&lt;p&gt;The Problem&lt;br&gt;
PR intelligence tools like Cision and Muck Rack cost $10,000–$30,000 per year. They're inaccessible to most founders, startups, and SMEs — the exact people who most need earned media coverage to grow.&lt;/p&gt;

&lt;p&gt;The alternative is hours of manual research: scanning publication mastheads, reading journalist bylines, guessing at beats and tone. Even then, the outreach is generic because there's no structured data behind it.&lt;/p&gt;

&lt;p&gt;The core problem isn't finding journalists. It's vocabulary asymmetry.&lt;/p&gt;

&lt;p&gt;A founder knows their product. They don't know how a journalist at TechCrunch would classify it, what beat editor covers their category, or which publications have recently covered adjacent topics. Traditional search tools enforce a tight validation loop — keep refining your query until you find something — and most users give up before they get there.&lt;/p&gt;

&lt;p&gt;What E_MediaScience Does&lt;br&gt;
E_MediaScience is a multi-tenant earned media operating system. A user submits a story, launch, or campaign brief. The system:&lt;/p&gt;

&lt;p&gt;Calls Octoparse MCP with intent-based parameters — not a URL, but a topic and geographic target&lt;/p&gt;

&lt;p&gt;Octoparse selects the appropriate journalist discovery template from its 600+ library and executes geo-routed extraction&lt;/p&gt;

&lt;p&gt;Returns clean, structured journalist profiles: name, outlet, beat, article history, tone markers, contact data&lt;/p&gt;

&lt;p&gt;Feeds that payload directly into Claude for AI newsworthiness scoring and journalist matching&lt;/p&gt;

&lt;p&gt;Generates personalised outreach referencing each journalist's actual recent work&lt;/p&gt;

&lt;p&gt;Tracks replies, open rates, and campaign strike rate&lt;/p&gt;

&lt;p&gt;The entire flow takes under 60 seconds from submission to matched journalist list.&lt;/p&gt;

&lt;p&gt;Why Octoparse MCP Changes Everything&lt;br&gt;
Before MCP, my options were:&lt;/p&gt;

&lt;p&gt;Build and maintain custom scrapers per publication (brittle, expensive, breaks constantly)&lt;/p&gt;

&lt;p&gt;Use a static journalist database (stale, expensive, no real-time beat tracking)&lt;/p&gt;

&lt;p&gt;Ask an LLM to find journalists (hallucinated profiles, made-up contact details)&lt;/p&gt;

&lt;p&gt;Octoparse MCP eliminates all three problems in a single tool call.&lt;/p&gt;

&lt;p&gt;text&lt;br&gt;
User submits: "I've launched an AI video clipping tool for live-sellers"&lt;/p&gt;

&lt;p&gt;EMS calls Octoparse MCP:&lt;br&gt;
→ Template: journalist-discovery-tech-ecommerce&lt;br&gt;
→ Parameters: { topic: "AI video tools, live commerce, creator economy", regions: ["UK", "US"] }&lt;/p&gt;

&lt;p&gt;Octoparse returns:&lt;br&gt;
→ 12 journalist profiles, structured JSON&lt;br&gt;
→ Beat: "Commerce technology, live shopping, creator tools"&lt;br&gt;
→ Recent articles, outlet, contact data — all clean, no parsing&lt;/p&gt;

&lt;p&gt;Claude scores:&lt;br&gt;
→ Newsworthiness: 74/100&lt;br&gt;
→ Top match: [Journalist at The Information, beat: AI/Creator Economy]&lt;br&gt;
→ Personalised pitch: references journalist's last 3 articles&lt;br&gt;
No HTML. No CSS selectors. No fragile extraction logic. The structured payload goes straight into the LLM.&lt;/p&gt;

&lt;p&gt;The HungQueryResolver — V1.1 Innovation&lt;br&gt;
The most technically novel part of E_MediaScience is the HungQueryResolver — built specifically around what Octoparse MCP can do when queries fail.&lt;/p&gt;

&lt;p&gt;The problem: clients describe their PR targets in natural language that doesn't map cleanly to journalist taxonomies. "Find people who write about the neat tech stuff I make" is a real query. Traditional systems force clarification loops until the user gives up.&lt;/p&gt;

&lt;p&gt;The HungQueryResolver uses a Three-Strike Escalation architecture:&lt;/p&gt;

&lt;p&gt;Turn 1 — Direct Match&lt;br&gt;
Octoparse MCP called with the raw query. High-confidence matches are returned immediately.&lt;/p&gt;

&lt;p&gt;Turn 2 — Drift Validation&lt;br&gt;
If confidence falls below threshold, the user is prompted once for clarification. The system measures whether the new query actually adds new information — or just rephrases the same intent.&lt;/p&gt;

&lt;p&gt;Turn 3 — Async Escalation&lt;br&gt;
If the user is circling the same concept in different words, the system stops asking. A background worker fires a broadened Octoparse MCP call with expanded terminology, adjacent industry classifications, and alternate journalist taxonomies — silently, while the UI holds.&lt;/p&gt;

&lt;p&gt;Instead of a dead end, the user gets a scored set of alternative matches with a transparent quality rating explaining why each result was surfaced.&lt;/p&gt;

&lt;p&gt;This turns a search failure into a consulting asset.&lt;/p&gt;

&lt;p&gt;The Multi-Tool MCP Stack&lt;br&gt;
E_MediaScience was built in Cursor IDE using Claude Sonnet and Opus. The full MCP stack:&lt;/p&gt;

&lt;p&gt;Octoparse MCP — Structured journalist extraction (primary data source)&lt;/p&gt;

&lt;p&gt;Supabase MCP — Schema management, RLS policies, Edge Function deployment&lt;/p&gt;

&lt;p&gt;GitHub MCP — Automated commits across the GlafyCo org&lt;/p&gt;

&lt;p&gt;GlobalProxyManager — Custom geo-routing layer for multi-region journalist discovery across 100+ geographic IPs&lt;/p&gt;

&lt;p&gt;The combination of Octoparse (extraction) + Claude (reasoning) + Supabase (persistence) creates a closed-loop intelligence system where every journalist match is grounded in real, live web data.&lt;/p&gt;

&lt;p&gt;Production Philosophy — Pipeline Not Repository&lt;br&gt;
One design decision worth sharing: Octoparse MCP is never used as a data warehouse.&lt;/p&gt;

&lt;p&gt;Every extraction is immediately scored, matched, and actioned. Data follows a strict TTL policy:&lt;/p&gt;

&lt;p&gt;Days 1–30: Hot storage — full access, edit, download&lt;/p&gt;

&lt;p&gt;Days 31–60: Cold storage — read-only, raw source stripped&lt;/p&gt;

&lt;p&gt;Day 61: Hard delete&lt;/p&gt;

&lt;p&gt;This keeps infrastructure lean and reinforces the product positioning: E_MediaScience is a processing engine, not a data repository. Users ingest, score, pitch, and clear the decks.&lt;/p&gt;

&lt;p&gt;Pricing Model&lt;br&gt;
E_MediaScience uses a Core + Engines modular pricing architecture:&lt;/p&gt;

&lt;p&gt;Core Platform — $69/month (dashboard, 2 seats, campaign management)&lt;/p&gt;

&lt;p&gt;Signal Engine bolt-on (EMS) — from $29/month&lt;/p&gt;

&lt;p&gt;Production Engine bolt-on (Clipositing video engine) — from $29/month&lt;/p&gt;

&lt;p&gt;Agency tiers — from $799/month with HighLevel CRM integration&lt;/p&gt;

&lt;p&gt;No credits. No per-minute charges. No "credit hostage-taking." Flat session-based pricing that scales by tier, not by usage clock.&lt;/p&gt;

&lt;p&gt;Repo&lt;br&gt;
Everything is open and committed:&lt;/p&gt;

&lt;p&gt;GitHub: github.com/GlafyCo/E_MediaScience&lt;/p&gt;

&lt;p&gt;The architecture docs, sprint plans, and HungQueryResolver spec are all in docs/strategy/. The multi-tenant core, AI scoring engine, and Supabase migrations are all there.&lt;/p&gt;

&lt;p&gt;Built with Octoparse MCP + Cursor + Claude for the Octoparse MCP Challenge 2026.&lt;/p&gt;

&lt;p&gt;Ian Taylor — Founder, GlafyCo | Wales, UK&lt;br&gt;
Building E_MediaScience, Clipositing, and the GlafyCo AI platform stack | X: &lt;a class="mentioned-user" href="https://dev.to/ianbuildsagents"&gt;@ianbuildsagents&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>javascript</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
