DEV Community: Daniel Hnyk

The Self-Optimizing SEO Pipeline

Daniel Hnyk — Fri, 20 Mar 2026 19:07:58 +0000

These posts are somewhere between a case study and a forkable example. We open-sourced the skills, agents, and Python utilities at github.com/futuresearch/example-cc-cronjob - they won't work as-is (you'll need your own API keys and sources), but they show all the important bits we use in production. We build FutureSearch - forecast, score, classify, or research every row of a dataset - and these pipelines are how we market it.

Update (March 2026): When this post was written, we had two separate domains — futuresearch.ai for research articles and everyrow.io for product pages and docs. We've since consolidated everything onto futuresearch.ai. The pipeline is simpler now: one domain, one GSC property.

SEO for a small product is a treadmill. We have 75 pages and our top page has 14,757 impressions and 7 clicks - 0.05% CTR. Thousands of people see that listing and scroll past it every week. Figuring out which titles to change, what to change them to, and whether the last change helped or hurt is a spreadsheet job nobody does consistently. But it compounds: a title change that lifts CTR from 0.03% to 0.1% on a 14,000-impression page means 10 more clicks per week.

The marketing pipeline from Post 3 scans communities for people with data problems. This pipeline does something narrower: it reads our own search data and proposes changes to improve what we already have. It reads a week of Google Search Console data, spawns an Opus-model agent for every page, and proposes title and description changes. Each agent reads the history of every change we've made to that page, what the search data looked like before and after, and whether the outcome improved. The next suggestion comes from that history, and it gets better over time.

The Pipeline

Five phases. 330 lines of markdown, running on the infrastructure from Post 1 using the workflow patterns from Post 2.

Phase 1: Collect GSC Data
  └── MCP server fetches from Google Search Console (both domains)
       ↓ 6 API calls → raw JSON on disk
Phase 2: Prepare Per-Page Inputs
  └── Python script computes deltas, matches queries to pages
       ↓ 75 per-page JSON files
Phase 3: Analyze All Pages
  └── seo-page-analyzer agents (batches of 10) + seo-new-page-proposer
       ↓ each agent writes suggestion back to its input file
Phase 4: Record Proposed Changes
  └── Collect all suggestions into changes JSON
       ↓
Phase 5: Report + PR
  └── Markdown report with performance table + proposed changes
       ↓ branch, commit, push, PR

How It Collects Data

The pipeline reads from Google Search Console via an MCP server - mcp-server-gsc. One-time setup: a .mcp.json in the project root (the credentials file mounts as a Kubernetes secret):

{
  "mcpServers": {
    "google-search-console": {
      "command": "npx",
      "args": ["-y", "mcp-server-gsc"],
      "env": {
        "GOOGLE_APPLICATION_CREDENTIALS": "./gsc-credentials.json"
      }
    }
  }
}

Claude Code discovers the tool automatically. The skill file says:

mcp__google-search-console__search_analytics:
  siteUrl: "sc-domain:futuresearch.ai"
  startDate: "{start}"
  endDate: "{end}"
  dimensions: "query,page"
  rowLimit: 25000

Six API calls total - page performance, query-page mappings, and all queries, for each domain. Raw JSON lands on disk.

How It Decides What to Change

lib/seo_prepare.py transforms the raw GSC data into per-page input files. Each file has everything an agent needs to make a judgment call:

{
  "slug": "openai-revenue-forecast",
  "domain": "futuresearch.ai",
  "category": "research",
  "current_metadata": {
    "title": "OpenAI's Financial Forecast 2025-2027",
    "description": "..."
  },
  "gsc_current": {
    "clicks": 5, "impressions": 14480,
    "ctr": 0.03, "position": 7.8,
    "queries": [
      { "query": "openai revenue 2026", "impressions": 716, "position": 10.4 }
    ]
  },
  "gsc_diff": { "clicks_delta": 5, "impressions_delta": 2961 },
  "experiment_history": [...]
}

The lib + agent pattern from Post 2: Python handles the mechanical work (parsing JSON, computing deltas, matching queries to pages), and the agent handles the judgment (is this title working? did last week's experiment improve CTR?).

The skill runs agents in batches of 10. Each seo-page-analyzer - running Opus, because judgment matters here - gets one page and makes one decision: suggest a title change, a description change, a content change, or nothing. Eight batches cover all pages. A separate seo-new-page-proposer reads unmatched queries and flags gaps where we're missing traffic entirely.

The agents follow a decision framework in the agent definition:

Product pages (Dedupe, Merge, Rank, Screen) always get experiments, even at zero impressions. Low traffic is a reason to experiment.
Research pages with CTR above 2% and good position get left alone unless the top queries clearly don't match the title.
Title formats rotate - question, how-to, keyword-colon-descriptor, direct imperative - so the site doesn't turn formulaic.

On one run, the same pipeline proposed:

"How to Search Government Websites at Scale, for Investors" → "Which Texas Cities Have the Fastest Permit Approval Times?" - question format, specific geography
"Using LLMs for Data Cleaning At Scale" → "LLM Deduplication at 20,000 Rows: F1=0.996 for $1.12 per 1k Rows" - specific numbers for a developer audience

The output of a single run is a PR. Two real excerpts from the March 18th report - one routine, one where the history caught a mistake:

**forecasting-top-ai-lab-2026** - description
- Was: (empty)
- Proposed: "We ranked OpenAI, Anthropic, Google DeepMind, xAI, and Meta across
  model quality, data, compute, talent, and R&D automation. See who is winning
  the AI race in 2026 and where each lab stands heading into Q2."
- Why: 14,757 impressions, 7 clicks (0.05% CTR) despite ranking position 1-5 for
  many queries. Description is empty - Google is writing its own snippet. Adding
  a concrete description is the lowest-effort lever left on this page.

**lead-scoring-without-crm** - title
- Was: "How to Score Leads with AI When You Don't Have a CRM"
- Proposed: "AI Lead Scoring Without Clay: Rank 500 Prospects for $28"
- Why: Previous experiment removed 'Clay' from the title. Result: clay lead
  scoring impressions dropped from 39 to 1, all Clay-related queries lost.
  History shows this was a clear regression. Adding it back with specific
  numbers targets the audience that was converting.

Nothing gets applied automatically. A human reviews the proposals, picks the ones worth trying, and applies them. Takes about 20 minutes.

How It Gets Better

Every page's input file includes experiment_history - every change we've made, when we made it, the search data before and after, and whether the outcome improved, stayed flat, or regressed:

{
  "experiment_date": "2026-01-15",
  "change_type": "title",
  "old_value": "OpenAI Revenue Report",
  "new_value": "OpenAI's Revenue in 2027: A Comprehensive Forecast",
  "data_before": {
    "clicks": 5, "impressions": 18000,
    "ctr": 0.03, "position": 8.2
  },
  "data_after": {
    "clicks": 10, "impressions": 22039,
    "ctr": 0.05, "position": 7.5
  },
  "outcome": "improved"
}

The analyzer reads this before suggesting the next change. A title that improved CTR informs the next experiment. One that regressed is a "don't repeat this" marker. It's closer to a consultant who keeps notes than anything resembling ML. The JSON file is the notebook. Each run reads it before writing in it.

The agents don't share history across pages. The learning is per-page: what was tried, what happened, what to try next. After six runs across two months, some patterns are clear:

Question-format titles outperform statement titles for research articles
Specific numbers in case study titles ("F1=0.996 for $1.12 per 1k Rows") lift CTR on developer-focused pages
Empty descriptions on high-impression pages are a recurring catch - our top page ran for weeks with no meta description while Google wrote one for us

Where It Stands

Pages analyzed grew from 35 to 80 over the first few runs. From the March 18th run:

80 pages tracked
14,757 impressions on our top page (forecasting-top-ai-lab-2026)
69 changes proposed

The docs pages are still early. The Dedupe reference page has 12 impressions. The Merge reference page has 0. The pipeline treats them the same as the 14,757-impression research articles, but with different rules: always experiment on product pages, leave well-performing research pages alone. We're building product page SEO while the research articles carry traffic.

A non-technical person on the team opens the PR, reads through the proposed changes, and applies the ones that make sense. The pipeline produces 69 suggestions with reasoning and data. The human spends 20 minutes deciding which ones to run. Neither does this alone - the human wouldn't compute deltas across 80 pages every week, and the pipeline doesn't get to change titles on a 14,000-impression page without someone reviewing it first.

Next: An LLM Pipeline That Uses Its Own Product - the pipeline that finds today's news, calls our own product, and generates sardonic data visualizations about Microsoft Copilot's dignity.

We build FutureSearch - forecast, score, classify, or research every row of a dataset. This pipeline is how we optimize its SEO.

FutureSearch lets you run your own team of AI researchers and forecasters on any dataset. Try it for yourself.

Marketing Pipeline Using Claude Code

Daniel Hnyk — Wed, 11 Mar 2026 16:06:53 +0000

These posts are somewhere between a case study and a forkable example. We open-sourced the skills, agents, and Python utilities at github.com/futuresearch/example-cc-cronjob - they won't work as-is (you'll need your own API keys and sources), but they show all the important bits we use in production. We build everyrow.io - forecast, score, classify, or research every row of a dataset - and these pipelines are how we market it.

People who need futuresearch.ai are out there - scattered across Reddit, StackOverflow, HubSpot forums, Salesforce communities, Make.com, Airtable, Shopify, GitHub, and a dozen others. Someone deduplicating a CRM where "IBM" and "International Business Machines" are the same company. Someone joining two tables that share no common key. Someone ranking leads by criteria a spreadsheet formula can't express. We narrowed it down to 18 sources where these conversations happen most often. The problem is that maybe 2-3% of posts are actually relevant. Manually scanning hundreds of posts every morning to find two or three good ones is not something a human is going to keep doing.

So we built a pipeline:

Phase 1: Scan
  └── 18 Python scanners fetch posts from Reddit, StackOverflow, HubSpot, ...
       ↓ dedup against seen.txt
Phase 2: Enrich
  └── Fetch full thread content: comments, replies, author info, vote counts
       ↓
Phase 3: Classify
  └── 13-question rubric per thread, assign score 1-5
       ↓ filter to score 4-5
Phase 4: Propose
  └── Select strategy, match to demo catalog, draft forum response
       ↓
Phase 5: Report + PR
  └── Markdown report with metrics, draft responses. Branch, commit, push, PR.

Every weekday at 08:00 UTC, a CronJob runs this end-to-end, unattended, in about 14 minutes. The output is a pull request someone on the team opens over coffee. It runs on the infrastructure from Post 1, using the workflow patterns from Post 2. This post is about putting the concepts from Post 1 and Post 2 in use.

Dealing with Signal vs Noise

A typical run from February: 57 opportunities scanned, 35 enriched with full thread content, 35 classified. Score distribution: 1 scored 5, 1 scored 4, 33 scored 1-2. Eighty-nine percent is noise. And that's fine - those two good ones are what the whole pipeline exists for.

The noise is varied and no keyword filter catches it. About 50% of Reddit "opportunities" turn out to be competitor marketing posts dressed up as questions - someone promoting their deduplication tool while pretending to ask for advice. Discussion threads that start with "What's your favorite..." are never opportunities. Platform configuration bugs dressed as data problems - someone's Make.com aggregator is misconfigured, not facing a data quality issue. Career questions on Snowflake forums. "Show HN" builder posts. Exact-match problems where VLOOKUP works fine and the person just hasn't tried it yet.

Therefore, we have our own LLM powered classifier that uses a rubric with 13 structured questions. Not all of them are interesting (you can all of them in the example repo), but these are the ones that carry the most weight:

canonical: Is this a common problem others face daily, or bespoke? A canonical problem means a response helps thousands of future readers, not just one person.
tools_tried: What have they already tried? If they've tried fuzzy matching and it failed, they already understand why their problem is hard.
tried_llms: Have they tried ChatGPT for this? If they tried and it didn't work, they need a tool that actually scales.
importance: Does this look important? Business process blocked? "Our admin is drowning" is a different signal than "just curious."
commenter_solutions: What are commenters saying? If someone already solved it with a native platform feature - and the poster accepted the answer - there's no opportunity.
person_importance: Does the person look important? A StackOverflow user with 700k reputation answering "there's no solution" makes the thread more visible, not less.

The classifier's instructions include: "At no point should you Write() a Python script. If you think you need one, it's because you misunderstood these instructions." We added this after a classifier tried to write a sentiment analysis script instead of just reading the thread and thinking about it.

Examples: Three Real Finds

The Brazilian cities. Someone on StackOverflow was manually fixing about 5,000 Brazilian city name variants with SQL UPDATE statements. Bill Karwin - one of the highest-reputation answerers on StackOverflow - wrote: "there's no solution to correct 100% of the variations." SOUNDEX fails on Portuguese phonetics. The pattern table approach from another answer still requires manually enumerating every variation.

The pipeline found this at 8am scanning the record-linkage tag. The classifier scored it 5. The proposer matched it to demo C11 (Challenging + Messy) and drafted a response showing the everyrow SDK:

from everyrow.ops import dedupe

result = await dedupe(
    input=cities_df,
    equivalence_relation="""
        Same Brazilian city, accounting for:
        - Accent differences (Florianopolis vs Florianópolis)
        - Abbreviations (Sto Andre vs Santo André, S Jose vs São José)
        - Typos and spacing variations
    """,
    strategy='select',
)

The equivalence_relation is natural language - you describe what counts as a match and the model handles the linguistic reasoning. No regex, no phonetic algorithm, no pattern table. We reviewed the draft, tweaked a sentence, and posted it.

The Make.com 75K-row CSV. A user on Make.com had a 75,000-row CSV and needed both exact AND similar matches. Make.com's AI agent can't handle that scale - it's designed for conversational Q&A, not batch processing. The only commenter suggested exact-match approaches (map/aggregator), which completely miss the semantic similarity requirement. The pipeline classified it as a score-4 opportunity and drafted a response showing how everyrow dedupe handles the full 75K rows in one pass, with instructions for getting results back into a Make workflow.

The Agentforce problem. "We bought Agentforce but can't use it because our Salesforce data is a mess." Company names listed 3-4 different ways, contacts missing emails, opportunities linked to wrong accounts. 58 upvotes, 35 comments. This represents a category the pipeline keeps discovering - AI-readiness problems, where companies buy AI tools and find their data isn't ready. The pipeline found it, classified it, and we posted a response showing CRM deduplication with the SDK: 210 records in, 42 duplicates found, 52 seconds, $0.23.

from everyrow.ops import dedupe

result = await dedupe(
    input=crm_data,
    equivalence_relation="Two entries are duplicates if they represent "
    "the same company, accounting for abbreviations, typos, and subsidiaries",
)
# 210 rows → 168 unique entities, 42 duplicates identified

What Works and What Doesn't

After two months of daily runs, the source-level data is clear:

Source	Hit rate	Notes
Reddit	1.5-3%	Consistently highest signal
Databricks	~40%	Low volume (1-2/run) but when it hits, it hits
StackExchange	2-5% on classic tags	`record-linkage`, `string-matching` work. `excel`, `google-sheets` yield 0%
Make.com	Moderate	Workflow builders who need AI at one step
Salesforce	Occasional	High-quality finds when they appear
n8n	0%	132 posts across 7 runs. Zero data problems.
Retool	0%	300+ posts. Platform support only.

We kept scanning n8n for seven consecutive runs hoping something would turn up. Every run found posts about workflow configuration, OAuth setup, and version upgrade bugs. The learnings file eventually said what we already knew: discontinue.

Cost: $5-8 per run for our own utilities. Fourteen minutes is the overall runtime, we pay 200 Max Anthropic plan.

The pipeline also reveals other interesting findings as it analyzes historical questions, such these market shifts:

LLM adoption inflection: People who tried LLMs before asking for help went from 6-8% (2020-2023) to 33% in 2025. A third of our prospects have already tried ChatGPT and found it doesn't scale.
StackOverflow collapse: StackOverflow went from 23% of our opportunities in 2020 to 3% in 2025. Reddit grew from 6% to 36%. Technical Q&A has fragmented into product-specific communities - which is exactly why we need 18 scanners instead of one.

The Response Strategy

For opportunities scoring 4 or 5, product-specific proposer agents take over. Each proposer reads our product docs and a catalog of 29 existing demos, then generates a response using one of these strategies:

Strategy	When
PROVE_CAPABILITY	Default (~80%). Show a demo proving we solve the problem.
SHOW_SDK_CODE	Technical audience. Lead with a code snippet.
SHOW_INTEGRATION	Workflow platform users. Show how results fit their pipeline.
EXPLAIN_APPROACH	Audience wants to understand why LLMs beat fuzzy matching.
OFFER_HANDS_ON	Recent post, engaged OP. Offer to run their data.

The proposer matches each problem to the closest existing demo - 29 demos organized by difficulty, and the proposer reads the catalog and picks. When the poster provides sample data, it shows results on their data. When they don't, it shows results on the closest demo we have.

The test for every draft: if someone stripped the product mention, would this answer still be useful?

This is where the loop closes. As we described in Post 2, the output of the whole system is a pull request. A non-technical person on the team opens it, reads the report, and sees the draft responses with working code snippets and real results. They adjust the tone, maybe add a sentence from their own experience, and post it. The person on the other end gets a genuinely helpful answer to a problem they were stuck on. That's the point - not to pollute forums with product links, but to find people who are actually struggling with something our tools solve and help them. Together, it takes 15 minutes of human time for what would otherwise be a full day of research.

The Pipeline Teaches Itself

After each run, the pipeline can update a learnings file. These aren't logs - they're instructions for future runs:

- "Remove 'duplication' tag - returns feature posts, not data problems"
- "Databricks: low volume but 40% conversion. Worth keeping."
- "If native platform feature exists and author accepts it → score 1-2"
- "Christmas Eve: 50% false positives. Likely holiday effect."

The next run reads the learnings before it starts. Over 6 weeks: 642 proposals in the database, 3,800+ URLs processed. The pipeline gets better because it remembers what didn't work.

The best finds aren't always in new threads. Thread archaeology - checking old discussions for unanswered or poorly-answered questions - turned up some of the strongest opportunities. The Agentforce post was months old when the pipeline found it.

The Simplified Example

We put together a runnable version of this pipeline at github.com/futuresearch/example-cc-cronjob - the same repo from Post 1, now with a community-scanner skill alongside the original add-numbers example. It has the full structure: a skill with all five phases, a classifier agent with the 13-question rubric, a proposer agent with the strategy taxonomy and SDK examples, a Python scanner that fetches from Reddit's public JSON API, and a learnings file the pipeline updates after each run. It scans a few subreddits instead of 18 sources, and runs in a single process instead of fanning out to parallel subagents, but the pipeline logic is the same. Fork it, point it at your subreddits, see what it finds.

What We Know Now

The infrastructure is the easy part. The know-how - which sources to scan, what questions to ask, how to draft a response that genuinely helps someone - is what those daily runs teach you. If this stopped working tomorrow, we'd manually check a few subreddits once a week. Like we did before December.

We build futuresearch.ai - forecast, score, classify, or research every row of a dataset. This pipeline is how we find the people who need it.

FutureSearch lets you run your own team of AI researchers and forecasters on any dataset. Try it for yourself.

Using Claude Code as a Workflow Engine

Daniel Hnyk — Fri, 27 Feb 2026 13:14:01 +0000

Part 2 of a series on using Claude Code as a production runtime. Originally published on everyrow.io.

Our marketing pipeline scans 18 community sources, enriches threads with full content, classifies opportunities with a 20-question rubric, generates draft forum responses, and creates a pull request - every weekday at 08:00 UTC. The whole pipeline definition is not e.g. Python's functions with some workflow manager and executor like Prefect or Dagster (which are both cool), but - yeah, you guessed it - a markdown file in plain English, written by my boss.

I don't mean my boss specified it and an engineer implemented it. I mean he opened SKILL.md in his editor and typed the pipeline in English. Or more precisely - in the light of this series - he asked Claude Code to write it together with him. It's a markdown file that says things like "spawn 18 scanners in background" and "after phase 1, do phase 2." It's not a formal task DAG and isn't specified in code. And it all runs inside Claude Code, as described in our first post from the series. This post is about a generic comparison of such systems, while we will see specific instances of that in the subsequent posts.

Rough Comparison

We're not going to pretend this is better than Prefect or Dagster. For a lot of workloads, it's worse. But "a lot" isn't "all," and we think the tradeoff space is genuinely interesting. Here is a somewhat naive comparison:

	Prefect / Dagster	Claude Code
Task definition	Python functions, objects, decorators, ...	Markdown files
DAG	Explicit dependency graph	"after scanning, enrich"
Workers	Containerized functions	Subagents with their own context windows
Retry logic	`@task(retries=3)`	"if Python enrichment fails, try WebFetch instead"
Adding a new integration	Install plugin, configure IO manager, write config schema	"read from BigQuery"
Scaffolding	Specific decorators, YAML, `definitions.py`, webserver config, user code, ...	Markdown files
Deployment	webserver, usercode containers, UIs, DB, ...	one (cron)job as per Part 1 of this series
Monitoring	Dashboards, metrics, alerts, orchestration UIs	none?
Who writes it	Software engineer	anyone, in English
Debugging	Stack traces, breakpoints	Absolutely horrendous

I am not gonna pretend this comparison is not skewed towards Claude Code, it heavily is, as it's not a full replacement of these. Dagster is giving you stuff like sensors, queues, concurrency, work pools and what not, and if you need those, then go for it. What we are covering here is mostly the job runtime and basic orchestration (which could still be plugged into frameworks like this to benefit from both worlds).

I want to write something like "it's all markdown files", which is a little bit of an exaggeration, but not much! The whole setup is one skill (the orchestrator), a handful of subagent definitions, and some Python libraries for the mechanical stuff. Compare that to e.g. Dagster scaffolding. Dagster is pretty opinionated here and you really want to do things the way it wants you to - definitions.py, YAML config, webserver, user code server, and if you want to read from GCS, the right IO manager plugin configured through Dagster's abstraction layer instead of just... asking Claude to use gsutil. It's all legitimate infrastructure for production workloads. If tomorrow we need to read from BigQuery, we write "query BigQuery for the last 7 days of page analytics" in the skill file and Claude figures out the bq command or the MCP tool or whatever's available (setting those + permissions is still some annoying boilerplate though).

How It Works

The pipeline is one skill that orchestrates six phases. Most of the heavy lifting is fanned out to subagents running in parallel. We will get into the details of the pipeline in a separate post, but just to give you an idea of what we're talking about:

Phase 1: Scan
  ├── Python search script → produces shards
  ├── 18 scanner subagents (one per source: reddit, hubspot, shopify, ...)
  └── N search-scanner subagents (one per shard)
       ↓ poll filesystem for .json / .error files
Phase 2: Enrich
  └── Python enrichment (fetch full thread content, WebFetch fallback)
       ↓
Phase 3: Classify
  └── N classifier subagents (one per enriched file, 20 questions, score 1-5)
       ↓ poll filesystem again
Phase 4: Propose
  └── proposer-{product} subagents (one per product with score 4-5 hits)
       ↓
Phase 5: Report
  └── markdown report with metrics, top opportunities, draft responses
       ↓
Phase 6: Git
  └── branch, commit, push, open PR

The orchestrator - the main Claude Code process - reads the skill, spawns the subagents via the Task tool, and coordinates them. The subagents write results to disk. The orchestrator polls for output files rather than collecting agent output directly (we'll get to why in the filesystem section below). The "dependency graph" is just document order: phase 2 comes after phase 1 because it's written after phase 1.

The Accidental Resilience

Here's one aspect that the comparison table doesn't capture: Claude Code is accidentally resilient in ways that traditional orchestrators are not.

When a Python script hits an unexpected error, it crashes. By default, the state is lost, you go to some logging tool or something and try to find the bug. If you are lucky, you fix the bug, but often you're not sure if it's hard to reproduce (reddit blocks IPs from GCP), you re-run the whole thing, and hope. Orchestrators are trying to help you by giving you e.g. retries mechanisms, which are good, but far from ideal when dealing with unknown unknowns, i.e. how many retries do you need, what the backoff period is, on what type of errors should you try to retry and so on.

When Claude Code hits an error, it reads the error message and decides what to do. A library isn't installed in the container? It runs apt-get install (scary, but awesome). An API returns an unexpected format? It adapts the parsing. The enrichment script returns fewer results than expected? The pipeline instruction says "use WebFetch for the failed URLs" - and it does, for just the ones that failed, preserving everything that already worked.

This is not magic. It's just that the "retry logic" has access to the same reasoning that wrote the original attempt. It can distinguish between "the server is down, try again" and "this approach won't work, try a different one," in a way traditional retries cannot.

And the state preservation is a great feature on its own. When running locally and phase 3 of our pipeline fails, phases 1 and 2's results are still on disk and in the conversation context. If we're running interactively, we can --resume and say "phase 2 worked fine, start from phase 3 and here's what went wrong." The agent just remembers everything - no checkpoint files, no serialization, no cache key configuration.

Prefect and Dagster have caching, and it's a real feature. But getting it right is real engineering work: hash the inputs properly for the cache key, make sure the task-level cache interacts correctly with the flow-level cache, handle the case where a cached task succeeds but the next task fails, deciding where the cache is stored... We've been through this, and sometimes it's just not worth the effort.

What a Skill Looks Like

This is a real excerpt from our pipeline definition:

## Phase 1: Scan

### Step 1b: Run Domain Scanners

Spawn all 18 domain scanners in background.
Track each task_id with its source name.

Each: Task (subagent_type: scanner, run_in_background: true): "Scan {source}"

That's the DAG basically: "Spawn 18 things. Track them." Claude Code reads this, spawns 18 subagents, and tracks the task IDs. The "dependency graph" is the document order: Phase 2 comes after Phase 1 because it's written after Phase 1, as... any human naturally works and thinks.

Running it is what we covered in the Part 1. You pass "execute scan-and-classify skill" as a prompt and it runs. Again, you don't have to think about deployments and flags and if you should use deploy() or serve() or anything, just CLI command.

What an Agent Looks Like

We do have specialized agents that do specific jobs. Agents are spawned with their own context, so the context of the main orchestrating agent doesn't explode. Each subagent is a markdown file with YAML frontmatter:

---
name: scanner
description: Scan a community source for marketing opportunities.
tools: Bash, Read, Write, Glob, Grep
model: sonnet
permissionMode: bypassPermissions
---

We have 23 of these agent definitions. Scanners, classifiers, proposers, graphics generators, dataset finders, SEO analyzers. Each one is a markdown file describing what the agent should do, what tools it has, and what model to use.

Python Does Mechanics, Claude Does Judgment

One of the design principles is putting mechanics in code and letting Claude make judgments. Specifically, it's this separation of concerns:

lib/scanners/reddit.py     → Fetches posts, parses JSON, handles rate limits
.claude/agents/scanner.md  → Reads posts, decides "is this a real data problem?"

This is not a strict separation - it's totally fine that the agents are writing some code. But if it's something that can be reused and standardized, it's good to add it, but it's still quite wasteful resource-wise to let agents do everything like scanning API endpoints and stuff.

This is yet another part where running inside Claude Code shines - it can develop and improve itself while running in production. No, really, let that sink in for a second, because you cannot just gloss over it: the development and the runtime blend together. You tell it to run the scanner for a site and it tells you it can't because X, but presents you with a workaround for X that it can incorporate into the skill or lib for future runs. When it encounters that the environment has changed - like an API schema on the remote is different - it can self-correct during the runtime, and even commit that as an improvement for any further runs.

The Filesystem Is the Message Bus

Here's where it gets ugly. In Prefect, it's the orchestrator backend that manages dependencies. Claude is of course able to get state of agents natively, and it worked fine for like 4 agents. When we scaled to 18, the orchestrator's context window filled up with all the returned output, as it seems to be a limitation that Claude cannot get the state without also parsing the output. Claude started forgetting earlier results and producing incomplete reports.

The fix: run_in_background: true + filesystem polling. The orchestrator's context went from O(n * output_size) to O(n * filename). The agents write their results to disk and the orchestrator only reads file paths. Specifically, when a scanner agent finishes, it writes data/scans/reddit/2026-02-17-run1.json. If it fails, it writes data/scans/reddit/2026-02-17-run1.error. The orchestrator polls:

while [ $ELAPSED -lt $TIMEOUT ]; do
  SUCCESS=$(ls -1 data/scans/*/${TODAY}-run*.json 2>/dev/null | wc -l)
  ERRORS=$(ls -1 data/scans/*/${TODAY}-run*.error 2>/dev/null | wc -l)
  if [ "$((SUCCESS + ERRORS))" -ge "$EXPECTED" ]; then break; fi
  sleep 10
done

.json means success. .error means failure. ls is the health check. This is not elegant.

Handling Timeouts

Agents will run forever if you let them, and given how imprecise and informal this setup is, you need to add some limits to it. There is nothing especially interesting here, but for completeness, we implemented that on four layers:

Layer 1: max_turns per agent. A hard limit on API round-trips. When a news-finder hits 30 turns, Claude Code stops it and returns whatever it has. We tuned these empirically - 30 was too few for news-finder, 20 was right for dataset-finder.

Layer 2: Wall-clock cap per phase. 10 minutes. If a batch of agents hasn't finished, move on with whatever completed. Mark the stragglers as "timeout" in the report.

Layer 3: Bash timeout 10800. After 3 hours, a second Claude wakes up to salvage partial results (see Post 1).

Layer 4: activeDeadlineSeconds. And finally the hard limit enforced by Kubernetes, set to 4 hours in our case.

Debugging Experience Is Bad

I want to be honest about this. Debugging a Claude Code pipeline is pretty subpar. There are no breakpoints or stack traces. When a subagent fails silently, you see nothing - you just notice a .error file appeared, if you remembered to implement .error files in the first place (we didn't).

And as there's no formal verification of any of it - no tests for the pipeline, no type checking on the DAG and so on, you interact with the system through Claude Code because a human genuinely cannot handle the throughput of many parallel scanners writing enriched JSON. It's vibecoding at its finest.

Also, the --dangerously-skip-permissions? The flag is named that for a reason, and there's no way to guarantee that Claude Code won't - to paraphrase a famous Haskell tutorial - go outside and scratch your car with a potato. We run it in an ephemeral container with limited credentials to reduce the blast radius. But if you're someone who needs formal guarantees about what your code will do at runtime, this approach should give you hives. We do acknowledge this, and are fully aware of the associated risks and tradeoffs, and given what this does, the stakes are low.

Three Quirks From the Skill File

And of course, writing pipelines in English produces some quirks you'd never see in a traditional codebase. Here are three picks from ours.

The one-sentence retry policy. This is the entire fallback logic for when enrichment fails:

If Python enrichment returns fewer opportunities than scanned,
use WebFetch for the failed URLs.
Add successful results with "enrichment_method": "webfetch".
Log URLs that fail both methods.

In Prefect, that's a custom retry handler with conditional logic. Here it's a paragraph and Claude figures it out.

The anti-coding instruction. The classifier agent's instructions include this:

At no point should you Write() a Python script. If you think you
need one, it's because you misunderstood these instructions.

We added this after a classifier tried to write a sentiment analysis script instead of just... reading the thread and thinking about it.

The "never fully fail" rule. The last section of the skill file:

- Scanner fails: Log failure, continue with others
- Python enrichment fails: Try WebFetch fallback, then continue
- Classifier fails: Log failure, continue with other sources
- Proposer fails: Log failure, keep intermediate files

Never fail the entire skill due to individual component failures.
Always produce a pipeline report.

That last line is doing a lot of work. Even a completely botched run produces a report saying "everything broke" - which is still more useful than a silent crash.

GitHub Is the UI

I genuinely enjoy nerding about subagent orchestration and filesystem-based message buses as much as anyone. But everyrow.io is a startup that needs to survive. The cool architecture means nothing if it doesn't change something out in the real world behind the boundaries of the company. Our pipeline generates mostly reports, and also needs some state like what URLs it has already seen and so on. A natural thing for this would be to use a database, define a schema, set up credentials, ... Well, we use GitHub as a store (together with LFS). Every pipeline run creates a PR with a markdown report and necessary state files like a seen.txt with already listed URLs. Our non-technical person opens it, reads the results, expands a draft response, tweaks a sentence, and responds to an opportunity. Or they open the news pipeline PR, pick the better graphic from two variations, downloads the PNG, and shares it. GitHub is the database, the UI, and the delivery mechanism.

This bridges the gap between what an AI can do and a human can do - neither on their own is as good as both together. The AI produces 80% of the work, the human fixes the last hardest 20% and takes action, and together they ship something neither could individually.

The flexibility matters more than the reliability here. When the classifier catches that a Reddit post is actually a competitor's marketing campaign, that's judgment no @task(retries=3) gives you. When news breaks about tariffs and the pipeline routes to QuantGov for regulatory data, that's not something you hard-code in a DAG. And when your boss can read the pipeline definition and say "add Snowflake to the sources" and it just works because the instruction is in English - that's the point.

We build everyrow.io - forecast, score, classify, or research every row of a dataset. This pipeline is how we find the people who need it.

Running Claude Code as a Kubernetes Job

Daniel Hnyk — Fri, 27 Feb 2026 12:03:32 +0000

Part 1 of a series on using Claude Code as a production runtime. Originally published on everyrow.io.

We run Claude Code in Kubernetes for a set of long-running marketing CronJobs. One scans communities like subreddits and support forums, another searches for news and generates relevant content, and the last one optimizes SEO for everyrow.io, our data processing product.

This originally sounded like a terrible idea, but after running it for a few months, we think it's a genuinely valid engineering approach - for the right kind of work. Everything is a tradeoff, and this series is a short journey through the practical engineering, actual use cases, and some beautiful metaphysics.

Our infrastructure for everyrow.io and futuresearch.ai runs on Google Kubernetes Engine, so that's where we'll start - here's what you need to make Claude Code work as a K8s CronJob, gotchas included.

Project Structure

For reasons explained in the next posts, we need both Python and Node. Claude is excellent at writing Python glue code (Python has been preparing for this time all its life), and we write in Python as well. Whenever Claude produces something useful for itself, we ask it to add it to the lib module for future reference. More on that later.

We put together a minimal runnable example at github.com/futuresearch/example-cc-cronjob - a Dockerfile, entrypoint, a trivial skill, and both a plain CronJob manifest and a Helm chart. Everything below is from our production setup, but if you just want to get something running, start there.

The Dockerfile

All right, let's start with a pretty standard Dockerfile:

# Build stage: install Python dependencies with uv
FROM ghcr.io/astral-sh/uv:python3.13-bookworm AS build
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --no-sources

# Runtime: Python + Node.js (Claude CLI needs Node)
FROM nikolaik/python-nodejs:python3.13-nodejs22

# jq for our "monitoring stack", librsvg2-bin for SVG→PNG, gh for PR creation
RUN apt-get update \
    && apt-get install -y jq librsvg2-bin git-lfs gh \
    && rm -rf /var/lib/apt/lists/*

RUN useradd -m -s /bin/bash claudie
USER claudie

# Install Claude CLI as non-root
RUN curl -fsSL https://claude.ai/install.sh | bash

# Skip the interactive onboarding. Claude CLI won't start without this.
RUN echo '{"hasCompletedOnboarding": true}' > /home/claudie/.claude.json

# Copy venv from build stage, copy project files, set PATH
USER root
COPY --from=build /app/.venv /home/claudie/.venv
COPY . /home/claudie/claudie
COPY deploy/entrypoint.sh /home/claudie/entrypoint.sh
RUN chown -R claudie:claudie /home/claudie
USER claudie
ENV PATH="/home/claudie/.venv/bin:/home/claudie/.local/bin:$PATH"
CMD ["/home/claudie/entrypoint.sh"]

A couple of things to notice:

We use multistage, building Python deps and copying them later - not strictly necessary but a nice optimization space-wise.
Claude Code requires Node.js - it's a Node app under the hood, hence the python-nodejs base image.
The hasCompletedOnboarding line: without it, Claude tries to walk you through a setup wizard. Given this runs in a terminal without TTY, this is obviously not what you want, hence this mini-hack.

The Entrypoint

The entrypoint is where you set up prerequisites for your workflow - credentials for MCP servers, SSH keys, and so on. In our case, one of the more important ones is gh (GitHub CLI), since we use GitHub as the place to store results and create PRs (more on that in the later posts).

The actual Claude Code process is spawned like this:

claude -p \
  --dangerously-skip-permissions \
  --verbose \
  --output-format stream-json \
  -- "$SKILL_PROMPT"

Let's unpack this:

-p simply means non-interactive mode.
--dangerously-skip-permissions is what it sounds like - the agent can do whatever it wants. We appreciate this is controversial and that sysadmins are screaming somewhere, but empirically, we haven't seen anything bad happen with the tasks we run.
--verbose together with --output-format stream-json gets the output out of Claude Code. By default, it only outputs the final message and you have no visibility into what it's doing. These two parameters make sure everything gets logged to stdout. There is a lot of detail - see the next section for filtering.
The -- separator before the prompt is important if you use --add-dir. Without it, the prompt gets consumed as another directory path.

The SKILL_PROMPT is literally something like execute scan-and-classify skill, optionally with --add-dir <some-path> if you need additional directories.

Filtering logs with `jq`

When Claude runs with --output-format stream-json --verbose, you get one JSON object per line - every thought, every tool call, every result... You'll want to filter this to something more sensible. We pipe it to jq and by trial and error found the following to be a sensible tradeoff between verbosity and volume:

claude ... | tee "$RAW_LOG" | jq --unbuffered -r '
if .type == "assistant" then
  .message.content[]? |
  if .type == "text" then ">>> " + .text[0:5000]
  elif .type == "tool_use" then "[" + .name + "] " + ((.input | tostring)[0:3000])
  else empty end
elif .type == "result" then
  "[done] " + (.result // "complete")[0:5000]
else empty end'

>>> for Claude's thoughts. [Read] or [Bash] for tool calls. [done] for completion.

The raw JSONL goes to /tmp/ for when you need to debug.

Timeout - The Safety Net

If you open the example entrypoint in the repository, you'll notice we wrap the execution with timeout 10800 bash -c 'claude ...'. Why isn't the Kubernetes job's activeDeadlineSeconds enough? Because we have a catch-all mechanism if things go wrong. Three hours (10800 seconds) is the timeout just for the Claude Code part. If Claude hangs - and it will, eventually - timeout kills it with exit code 124, and then a second Claude instance wakes up to collect whatever was created so far for debugging:

if [ "$CLAUDE_EXIT" -eq 124 ]; then
    timeout 600 claude -p --dangerously-skip-permissions -- \
      "The pipeline timed out. Check what partial results exist.
       Write a report. Commit to a branch. Create a PR with [PARTIAL] prefix."
fi

So... the CronJob spawns backup Claudes to clean up after a failed Claude. Not sure if this is robust engineering or a cry for help (both?), but it works.

The CronJob

The CronJob manifest is relatively simple:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: claudie-scan-classify
spec:
  schedule: "0 8 * * 1-5"          # 8am UTC weekdays
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      backoffLimit: 1
      activeDeadlineSeconds: 14400  # 4 hours - longer than the Claude timeout
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: claudie
              image: <your container registry>/claudie:latest
              env:
                - name: SKILL_NAME
                  value: "scan-and-classify"
              envFrom:
                - secretRef:
                    name: claudie-secrets
              resources:
                requests:
                  cpu: 100m
                  memory: 512Mi
                limits:
                  cpu: 2
                  memory: 4Gi

That's the whole thing. SKILL_NAME tells the entrypoint which skill to run. concurrencyPolicy: Forbid prevents overlap. Secrets go in via envFrom - the Anthropic API key, GitHub token, and whatever MCP servers need. We have three of these (scan, news, SEO) with different schedules. We wrap this in a lightweight Helm template, so adding a new skill is just an entry in values.yaml:

jobs:
  - name: daily-news
    skillName: daily-news-content
    schedule: "0 14 * * 1-5"  # Weekdays only (Mon-Fri)

  - name: scan-classify
    skillName: scan-and-classify
    schedule: "0 8 * * 1-5"  # Weekdays only (Mon-Fri)

  - name: seo-pipeline
    skillName: seo-pipeline
    schedule: "0 10 * * 1,3,5"  # Mon/Wed/Fri at 10:00 UTC

GitHub as a Database

One pattern worth calling out: we use GitHub as our entire storage and delivery layer. Every pipeline run creates a branch, commits results, pushes, and opens a PR. The PR is the output - our cofounder opens it, reads a markdown report, and acts on it. There's no database, no dashboard, no custom UI. Much more on this in the later posts.

To make this work from a container, the entrypoint sets up git and the GitHub CLI before Claude starts:

git config --global user.email "claudie-bot@example.com"
git config --global user.name "Claudie Bot"

mkdir -p ~/.ssh
echo "$SSH_PRIVATE_KEY" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh-keyscan github.com >> ~/.ssh/known_hosts 2>/dev/null

SSH_PRIVATE_KEY is a deploy key with write access to the repo. GH_TOKEN (passed as an env var) lets gh create PRs. Both go into the Kubernetes secret. The skill then just tells Claude to commit and create a PR - it knows how to use git and gh out of the box.

Our example repo demonstrates this: the add-numbers skill computes a result, writes it to a file, commits to a branch, and opens a PR. A toy example, but it's the same pattern our production pipelines use every day.

Should You Do This?

Probably not for anything important. I would resign if we used this for a payment pipeline. But for discovering that someone on r/salesforce needs help deduplicating 5000 company records? Take my money.

The next post covers what actually runs inside these CronJobs - specifically, why a 398-line markdown file replaced what would normally be a relatively non-trivial orchestration job.

We build everyrow.io - tools for semantic deduplication, entity resolution, and qualitative ranking of datasets. This pipeline is how we find people who need them.

Next: Using Claude Code as a Workflow Engine

5 DataFrame Operations LLMs Handle Better Than Code

Daniel Hnyk — Thu, 19 Feb 2026 09:21:01 +0000

There are things I do with DataFrames all the time that pandas was never built for. Filtering by subjective criteria. Joining tables that don't share a key. Looking up information that only exists on the web. Recently I've been using LLMs, and the results have been surprisingly cheap and accurate.

Here are five operations I now handle with LLMs (with working code).

1. Filter by Qualitative Criteria

You have 3,616 job postings and want only the ones that are remote-friendly, senior-level, AND disclose salary. df[df['posting'].str.contains('remote')] matches "No remote work available."

Cost: $4.24 for 3,616 rows (9.9 minutes)

from everyrow.ops import screen
from pydantic import BaseModel, Field

class JobScreenResult(BaseModel):
    qualifies: bool = Field(description="True if meets ALL criteria")

result = await screen(
    task="""
    A job posting qualifies if it meets ALL THREE criteria:
    1. Remote-friendly: Explicitly allows remote work
    2. Senior-level: Title contains Senior/Staff/Lead/Principal
    3. Salary disclosed: Specific compensation numbers mentioned
    """,
    input=jobs,
    response_model=JobScreenResult,
)

216 of 3,616 passed (6%). Interestingly, the pass rate has climbed from 1.7% in 2020 to 14.5% in 2025 as more companies are offering remote work and disclosing salaries.

Full guide with dataset · See it applied to real job postings: Screening job postings by criteria

2. Classify Rows Into Categories

You need to label 200 job postings into categories (backend, frontend, data, ML/AI, devops, etc.). Keyword matching misses anything that's not an exact match, but training a classifier is overkill for a one-off task like this.

Cost: $1.74 for 200 rows (2.1 minutes). At scale: ~$9 for 1,000 rows, ~$90 for 10,000.

from everyrow.ops import agent_map
from typing import Literal

class JobClassification(BaseModel):
    category: Literal[
        "backend", "frontend", "fullstack", "data",
        "ml_ai", "devops_sre", "mobile", "security", "other"
    ] = Field(description="Primary role category")
    reasoning: str = Field(description="Why this category was chosen")

result = await agent_map(
    task="Classify this job posting by primary role...",
    input=jobs,
    response_model=JobClassification,
)

The Literal type constrains the LLM to your predefined set, so there's no post-processing needed. You can add confidence scores and multi-label support by extending the Pydantic model.

Full guide with dataset

3. Add a Column Using Web Research

You have a list of 246 SaaS products and need the annual price of each one's lowest paid tier. There's no API for this kind of problem because it requires visiting pricing pages that all present information differently.

Cost: $6.68 for 246 rows (15.7 minutes), 99.6% success rate

from everyrow.ops import agent_map

class PricingInfo(BaseModel):
    lowest_paid_tier_annual_price: float = Field(
        description="Annual price in USD for the lowest paid tier"
    )
    tier_name: str = Field(description="Name of the tier")

result = await agent_map(
    task="""
    Find the pricing for this SaaS product's lowest paid tier.
    Visit the product's pricing page.
    Report the annual price in USD and the tier name.
    """,
    input=df,
    response_model=PricingInfo,
)

Each result comes with a research column showing how the agent found the answer, with citations. For example, Slack's entry references slack.com/pricing/pro and shows the math: $7.25/month × 12 = $87/year.

Full guide with dataset · See it applied to vendor matching: Matching software vendors to requirements

4. Join DataFrames Without a Shared Key

You have two tables of S&P 500 data — one with company names and market caps, the other with stock tickers and fair values. Without a shared column across both datasets, pd.merge() is useless.

Cost: $1.00 for 438 rows (~30 seconds), 100% accuracy

from everyrow.ops import merge

result = await merge(
    task="Match companies to their stock tickers",
    left_table=companies,   # has: company, price, mkt_cap
    right_table=valuations,  # has: ticker, fair_value
)
# 3M → MMM, Alphabet Inc. → GOOGL, etc.

Under the hood, it uses a cascade: exact match → fuzzy match → LLM reasoning → web search. The results show 99.8% of rows matched via LLM alone. And even with 10% character-level noise ("Alphaeet Iqc." instead of "Alphabet Inc."), it hit 100% accuracy at $0.44. I'd much prefer having to manually review the unmatched rows than deal with false positives.

Full guide with dataset · See it applied at scale: LLM-powered merging at scale

5. Rank by a Metric That's Not in Your Data

You have 300 PyPI packages and want to rank them by days since last release and number of GitHub contributors. This data is on PyPI and GitHub (not in your DataFrame).

Cost: $3.90 for days-since-release, $4.13 for GitHub contributors (300 rows each, ~5 minutes)

from everyrow.ops import rank

result = await rank(
    task="Rank by number of days since the last PyPI release",
    input=packages,
    field_name="days_since_release",
)

The SDK sends a web research agent per row to look up the metric, then ranks by the result. And it works for any metric you can describe in natural language, as long as it's findable on the web.