Eric Young

Posted on Jun 1

Prompting Is Not Enough: Code-Enforced Research Workflows for AI Agents

#ai #claude #agents #opensource

Most AI workflow failures do not happen because the prompt is too short.

They happen because the prompt is the only thing holding the process together.

In long research tasks, especially business research, the model can start well and still drift later:

It summarizes before verifying.
It treats weak sources as if they were primary evidence.
It updates a conclusion but forgets to update the chart or table behind it.
It cites a source that cites another source, then presents the second-hand claim as if it were original.
It becomes overconfident when the evidence is thin.
It skips the boring quality-control step when the context gets long.

This is why I built Alpha Insights as a harness-enforced research workflow instead of a large prompt template.

Alpha Insights is an open-source business research skill for Claude Code and Codex Desktop. It packages consulting-style research into a staged workflow with frameworks, evidence grading, validators, and report generation.

The more interesting part is not the list of frameworks. It is the execution model.

The Core Problem

Prompting is probabilistic.

You can ask the model to check sources, reconcile numbers, red-team its assumptions, and maintain chart consistency. Sometimes it will. Sometimes it will quietly skip the step, especially after the task becomes long and messy.

For casual work, that may be fine.

For research, it is not fine. A report can look polished while hiding weak evidence, stale numbers, mismatched charts, or unsupported conclusions.

So the design question changes:

Instead of asking, "How do I write a better prompt?"

Ask:

What artifact must exist before the workflow advances?
Which claims need source confidence?
Which checks should be deterministic?
Which failure modes should block the next stage?
What should be written to disk so the workflow can survive context drift?

That is the difference between a prompt and a harness.

What Alpha Insights Enforces

Alpha Insights uses the model for reasoning, synthesis, and judgment. But it tries to move repeatable control logic out of the prompt and into the surrounding system.

The workflow includes:

19 business frameworks: Porter's Five Forces, BCG Matrix, PESTEL, TAM/SAM/SOM, JTBD, flywheel, business model canvas, value chain, and more.
9 thinking methods: issue trees, MECE, hypothesis-driven research, pyramid principle, triangulation, first principles, ACH, pre-mortem, and expert-interview logic.
Evidence grading: claims are tagged by source confidence instead of treating all citations as equal.
Stage gates: validators and hooks block progression when required artifacts or checks are missing.
HTML reports: the final output is a decision-ready report with ECharts visualizations.

The important shift is that "do good research" becomes a set of explicit intermediate artifacts.

For example:

A research plan must exist before evidence collection.
Evidence needs source confidence instead of anonymous citation stuffing.
Claims should link back to supporting evidence.
Report headlines should not drift away from chart data.
Weak evidence should not support strong strategic recommendations without warning.

Some of these checks still require judgment. But many failure modes are mechanical enough to catch with code.

What Should Be Code, Not Prompt

After building and iterating on the workflow, I now think several AI-agent failure modes should be treated as engineering problems:

Stale numbers

If a number changes in one part of the report, downstream tables, charts, and executive summaries should not silently keep the old value.

Source laundering

If source A cites source B, the system should not pretend A is the primary source. The claim should preserve the evidence chain.

Chart/report mismatch

If a chart says 42% and the paragraph says 47%, that should be a validation issue, not a writing style issue.

Skipped artifacts

If the workflow requires a plan, an evidence ledger, a red-team pass, or a report-quality check, the system should verify that the artifact exists before moving on.

Overconfidence from weak evidence

If a claim is supported only by low-confidence sources, the language should not become definitive without an explicit warning.

These are exactly the kinds of things prompts are bad at enforcing over long sessions.

Harness Engineering

The pattern I am exploring is what I call harness engineering:

Use prompts to describe intent.

Use code, state machines, hooks, validators, and explicit files to enforce the workflow.

The model is still doing the hard thinking. But the system around it decides whether the work is complete enough to advance.

That boundary matters.

If everything lives in the prompt, the model is both the worker and the inspector. In long workflows, that is fragile.

If the harness owns the process, the model can focus on reasoning while the system checks structure, evidence, and completion.

Why This Matters

AI agents are getting better at producing plausible work.

That makes verification more important, not less.

For business research, the goal is not a longer report. The goal is a report where the reasoning chain is visible, the evidence quality is explicit, and the workflow cannot quietly skip the boring parts.

Alpha Insights is one implementation of that idea.

GitHub: https://github.com/Ericyoung-183/alpha-insights

Demo report: https://ericyoung-183.github.io/alpha-insights/assets/demo-report.html

MIT licensed. Feedback is very welcome, especially from people building agent workflows where the boundary between model judgment and deterministic enforcement is still unclear.