Lars Winstand

Posted on Jun 28 • Originally published at standardcompute.com

I watched someone burn 50 hours on OpenClaw and the fix was embarrassingly simple

#ai #automation #devops #productivity

I knew this was worth writing the second I saw a post on r/openclaw from someone who said they spent 50 hours trying to automate freelancer scouting, evaluation, and outreach in one OpenClaw loop.

That sentence tells you exactly what happened.

Not beginner confusion. Not "AI is useless" rage.

It’s the much more dangerous state: just enough progress to believe the whole thing is one more prompt away.

I’ve seen this over and over in:

lead qualification automation
recruiter workflows
outbound prospecting
freelancer scouting
agent-based enrichment pipelines

The pattern is always the same:

Search for people
Evaluate them
Personalize outreach
Send messages
Hope one giant agent can do all of it cleanly

And then the workflow turns into soup.

The fix is not exotic.

Stop building one giant agent. Build a staged workflow.

Use OpenClaw for judgment. Use n8n or Make for orchestration. Use scripts and APIs for deterministic steps. Then make the LLM calls cheap enough that you can actually afford to test them properly.

The core mistake: treating OpenClaw like a whole ops team

OpenClaw is useful. But people keep asking the wrong thing from it.

It makes sense as an assistant harness:

stateful sessions
memory
tool use
model routing
provider failover

That is powerful.

But that does not mean you should hand it one giant business process and expect reliable end-to-end execution.

If your workflow is:

search the web for freelancers
filter by quality
rank candidates
write personalized DMs
send those DMs
log outcomes

...you do not have one task.

You have multiple systems with different failure modes pretending to be one task.

That distinction matters a lot.

OpenClaw itself kind of tells you this

If you look at the CLI shape, the mindset is obvious:

openclaw status
openclaw status --all
openclaw status --deep
openclaw logs --follow
openclaw doctor

That is not "trust the magic" UX.

That is inspect, debug, iterate UX.

Which is exactly how you should approach agent workflows.

Why the one-big-agent design fails

Because scouting, evaluation, and outreach are not one problem.

They are three different bugs wearing a trench coat.

1) Search fails before the model even starts thinking

A lot of bad agent workflows are just bad inputs with extra steps.

If your sourcing layer is weak, the model spends all its intelligence grading junk.

That’s why I agree with the people recommending Exa for search-heavy agent loops. Better retrieval quality matters more than people think.

For this class of workflow, search is not a helper. Search is the foundation.

A practical sourcing step looks more like this:

// pseudo-pipeline
const candidates = await exa.search({
  query: "freelance product designer SaaS portfolio",
  numResults: 25
})

const normalized = candidates.results.map((item) => ({
  name: item.author || null,
  title: item.title,
  url: item.url,
  snippet: item.text?.slice(0, 500) || ""
}))

That is already better than asking one giant prompt to both discover and judge candidates in the same breath.

2) Evaluation needs repetition, not vibes

This is where most agent demos fall apart in production.

You cannot validate a scoring prompt by watching it succeed once.

You need to run it on:

20 candidates
then 50
then edge cases
then obvious rejects
then ambiguous profiles

Then compare outputs.

Then tweak the rubric.

Then run it again.

That’s why n8n is useful here. Its evaluation workflow pattern is much closer to how engineers should test LLM logic.

A simple scoring loop

Put 25 rows in Google Sheets or a table:

| Candidate | Portfolio | Notes |
|----------|----------|
| A | https://example.com/a | Strong UI, weak case studies |
| B | https://example.com/b | Great B2B work |
| C | https://example.com/c | Mostly student projects |

Then score each row with a small, explicit prompt.

{
  "task": "score_freelancer",
  "criteria": {
    "relevant_experience": 0.4,
    "portfolio_quality": 0.3,
    "communication_clarity": 0.2,
    "risk_flags": 0.1
  },
  "output_format": {
    "score": "0-100",
    "reason": "short explanation",
    "decision": "reject|review|approve"
  }
}

And call the model in a narrow way.

const prompt = `
Score this freelancer for outbound outreach.

Candidate data:
${JSON.stringify(candidate, null, 2)}

Return JSON only:
{
  "score": number,
  "reason": string,
  "decision": "reject" | "review" | "approve"
}
`

That’s the work.

Not the flashy DM generation. The scoring loop.

3) Outreach is deterministic in all the annoying places

This is the part developers usually know already, but ignore because the autonomous-agent fantasy is fun.

Sending email, writing to Airtable, creating a HubSpot contact, updating Notion, posting to Slack, writing to Postgres — these are not LLM problems.

They are API problems.

So solve them with:

n8n
Make
direct scripts
native APIs
Composio
CRM connectors

Let the model decide what to say.

Do not let the model own the mechanics of how the message gets sent unless you enjoy debugging weird side effects.

Example split:

if (candidate.decision !== "approve") {
  return
}

const message = await generatePersonalizedDM(candidate)

await gmail.send({
  to: candidate.email,
  subject: "Quick question about freelance work",
  body: message
})

await hubspot.contacts.createOrUpdate({
  email: candidate.email,
  score: candidate.score,
  source: "exa-search"
})

That architecture is less magical.

It is also much easier to debug at 2 AM.

The workflow that actually survives contact with reality

If I were building freelancer scouting or lead qualification automation today, I’d structure it like this:

Source candidates with Exa or a deterministic scraper/API
Normalize records with code
Score candidates with a narrow LLM prompt
Run evals on a labeled dataset
Require approval for edge cases or high-value outreach
Generate personalization only for approved candidates
Send messages via Gmail, LinkedIn helpers, or CRM integrations
Log outcomes for prompt iteration later

That sounds slower than one giant agent.

In practice, it’s faster.

Because every broken part has a name.

One giant agent vs staged workflow

Approach	What actually happens
One big agent workflow	One prompt tries to scout, evaluate, personalize, and send. Fast to demo, miserable to debug, and every failure compounds.
Staged n8n or Make workflow	Separate steps for sourcing, scoring, approval, and outreach. Easier to test, easier to swap tools, easier to reason about.
Deterministic scripts plus agent judgment	APIs and scripts handle repeatable actions. LLMs handle ranking, extraction, and personalization. Best option for production reliability.

If you plan to run the workflow more than a few times, the staged version wins.

Not because it’s elegant.

Because it’s survivable.

The cost problem nobody likes admitting

Here’s the part that quietly wrecks architecture decisions:

retries cost money.

Every time you:

re-score a batch
retry extraction
run fallback prompts
compare models
evaluate 50 examples
regenerate outreach

...the meter runs.

And that creates bad behavior.

Teams start doing things they know are wrong because per-token pricing makes iteration feel expensive:

avoiding evals
under-testing prompts
keeping giant prompts instead of splitting steps
skipping retries
refusing to batch experiments

That’s how token anxiety becomes a design constraint.

For agent workflows, that is brutal.

Because the correct architecture usually involves more calls, smaller steps, more testing.

Which is exactly why flat-rate compute is so attractive for automation-heavy stacks.

If you’re running agents all day in n8n, Make, Zapier, OpenClaw, or custom workflows, predictable pricing changes behavior in a good way. You stop treating every iteration like a financial decision.

That’s the real appeal of Standard Compute.

It gives you an OpenAI-compatible API, but with unlimited AI compute at a flat monthly price instead of per-token billing. So the architecture you’d build if cost didn’t constantly nag you — staged workflows, repeated evals, lots of narrow model calls — becomes practical.

That matters a lot when your workflow is doing repeated ranking, scoring, rewriting, and fallback logic.

The boring architecture is usually the advanced one

People think the sophisticated setup is maximum autonomy.

Usually it isn’t.

Usually the sophisticated setup is:

OpenClaw for judgment and tool use
n8n for orchestration
Exa for search
Gmail or HubSpot connectors for delivery
a smaller model for repeated scoring
a larger model only when needed

That is not less advanced.

That is more advanced because it respects failure boundaries.

What I would build first

Not outreach.

That’s the bait.

Build the scoring loop first.

Start here

Collect 25 candidate profiles
Put them in Google Sheets or a database table
Define a scoring rubric
Run the same prompt across all rows
Compare outputs
Fix the rubric
Repeat

A minimal prompt is enough:

You are scoring freelancer candidates for outbound outreach.

Evaluate this candidate on:
- relevant experience
- evidence of quality work
- communication clarity
- fit for B2B SaaS work

Return JSON with:
- score (0-100)
- decision (reject/review/approve)
- reason (max 30 words)

Then, if you’re using OpenClaw, actually inspect the system while it runs:

openclaw status --deep
openclaw logs --follow
openclaw doctor

If you can’t explain why a candidate was selected, you are not ready to automate outreach.

That sounds harsh.

It is still cheaper than losing another 50 hours to a workflow that looked smart in a diagram.

The real lesson

The lesson is not that OpenClaw is weak.

The lesson is that people keep trying to compress messy human workflows into one heroic prompt.

That almost never works.

Break it apart.

Let LLMs judge.

Let scripts execute.

Let n8n or Make orchestrate.

And if you’re doing enough repeated LLM work that token pricing is warping your design, use infrastructure that doesn’t punish iteration.

That’s how these workflows stop being demos and start becoming systems.

DEV Community