DEV Community

Cover image for I watched someone burn 50 hours on OpenClaw and the fix was embarrassingly simple
Lars Winstand
Lars Winstand

Posted on • Originally published at standardcompute.com

I watched someone burn 50 hours on OpenClaw and the fix was embarrassingly simple

I knew this was worth writing the second I saw a post on r/openclaw from someone who said they spent 50 hours trying to automate freelancer scouting, evaluation, and outreach in one OpenClaw loop.

That sentence tells you exactly what happened.

Not beginner confusion. Not "AI is useless" rage.

It’s the much more dangerous state: just enough progress to believe the whole thing is one more prompt away.

I’ve seen this over and over in:

  • lead qualification automation
  • recruiter workflows
  • outbound prospecting
  • freelancer scouting
  • agent-based enrichment pipelines

The pattern is always the same:

  1. Search for people
  2. Evaluate them
  3. Personalize outreach
  4. Send messages
  5. Hope one giant agent can do all of it cleanly

And then the workflow turns into soup.

The fix is not exotic.

Stop building one giant agent. Build a staged workflow.

Use OpenClaw for judgment. Use n8n or Make for orchestration. Use scripts and APIs for deterministic steps. Then make the LLM calls cheap enough that you can actually afford to test them properly.

The core mistake: treating OpenClaw like a whole ops team

OpenClaw is useful. But people keep asking the wrong thing from it.

It makes sense as an assistant harness:

  • stateful sessions
  • memory
  • tool use
  • model routing
  • provider failover

That is powerful.

But that does not mean you should hand it one giant business process and expect reliable end-to-end execution.

If your workflow is:

  • search the web for freelancers
  • filter by quality
  • rank candidates
  • write personalized DMs
  • send those DMs
  • log outcomes

...you do not have one task.

You have multiple systems with different failure modes pretending to be one task.

That distinction matters a lot.

OpenClaw itself kind of tells you this

If you look at the CLI shape, the mindset is obvious:

openclaw status
openclaw status --all
openclaw status --deep
openclaw logs --follow
openclaw doctor
Enter fullscreen mode Exit fullscreen mode

That is not "trust the magic" UX.

That is inspect, debug, iterate UX.

Which is exactly how you should approach agent workflows.

Why the one-big-agent design fails

Because scouting, evaluation, and outreach are not one problem.

They are three different bugs wearing a trench coat.

1) Search fails before the model even starts thinking

A lot of bad agent workflows are just bad inputs with extra steps.

If your sourcing layer is weak, the model spends all its intelligence grading junk.

That’s why I agree with the people recommending Exa for search-heavy agent loops. Better retrieval quality matters more than people think.

For this class of workflow, search is not a helper. Search is the foundation.

A practical sourcing step looks more like this:

// pseudo-pipeline
const candidates = await exa.search({
  query: "freelance product designer SaaS portfolio",
  numResults: 25
})

const normalized = candidates.results.map((item) => ({
  name: item.author || null,
  title: item.title,
  url: item.url,
  snippet: item.text?.slice(0, 500) || ""
}))
Enter fullscreen mode Exit fullscreen mode

That is already better than asking one giant prompt to both discover and judge candidates in the same breath.

2) Evaluation needs repetition, not vibes

This is where most agent demos fall apart in production.

You cannot validate a scoring prompt by watching it succeed once.

You need to run it on:

  • 20 candidates
  • then 50
  • then edge cases
  • then obvious rejects
  • then ambiguous profiles

Then compare outputs.

Then tweak the rubric.

Then run it again.

That’s why n8n is useful here. Its evaluation workflow pattern is much closer to how engineers should test LLM logic.

A simple scoring loop

Put 25 rows in Google Sheets or a table:

| Candidate | Portfolio | Notes |
|----------|----------|
| A | https://example.com/a | Strong UI, weak case studies |
| B | https://example.com/b | Great B2B work |
| C | https://example.com/c | Mostly student projects |

Then score each row with a small, explicit prompt.

{
  "task": "score_freelancer",
  "criteria": {
    "relevant_experience": 0.4,
    "portfolio_quality": 0.3,
    "communication_clarity": 0.2,
    "risk_flags": 0.1
  },
  "output_format": {
    "score": "0-100",
    "reason": "short explanation",
    "decision": "reject|review|approve"
  }
}
Enter fullscreen mode Exit fullscreen mode

And call the model in a narrow way.

const prompt = `
Score this freelancer for outbound outreach.

Candidate data:
${JSON.stringify(candidate, null, 2)}

Return JSON only:
{
  "score": number,
  "reason": string,
  "decision": "reject" | "review" | "approve"
}
`
Enter fullscreen mode Exit fullscreen mode

That’s the work.

Not the flashy DM generation. The scoring loop.

3) Outreach is deterministic in all the annoying places

This is the part developers usually know already, but ignore because the autonomous-agent fantasy is fun.

Sending email, writing to Airtable, creating a HubSpot contact, updating Notion, posting to Slack, writing to Postgres — these are not LLM problems.

They are API problems.

So solve them with:

  • n8n
  • Make
  • direct scripts
  • native APIs
  • Composio
  • CRM connectors

Let the model decide what to say.

Do not let the model own the mechanics of how the message gets sent unless you enjoy debugging weird side effects.

Example split:

if (candidate.decision !== "approve") {
  return
}

const message = await generatePersonalizedDM(candidate)

await gmail.send({
  to: candidate.email,
  subject: "Quick question about freelance work",
  body: message
})

await hubspot.contacts.createOrUpdate({
  email: candidate.email,
  score: candidate.score,
  source: "exa-search"
})
Enter fullscreen mode Exit fullscreen mode

That architecture is less magical.

It is also much easier to debug at 2 AM.

The workflow that actually survives contact with reality

If I were building freelancer scouting or lead qualification automation today, I’d structure it like this:

  1. Source candidates with Exa or a deterministic scraper/API
  2. Normalize records with code
  3. Score candidates with a narrow LLM prompt
  4. Run evals on a labeled dataset
  5. Require approval for edge cases or high-value outreach
  6. Generate personalization only for approved candidates
  7. Send messages via Gmail, LinkedIn helpers, or CRM integrations
  8. Log outcomes for prompt iteration later

That sounds slower than one giant agent.

In practice, it’s faster.

Because every broken part has a name.

One giant agent vs staged workflow

Approach What actually happens
One big agent workflow One prompt tries to scout, evaluate, personalize, and send. Fast to demo, miserable to debug, and every failure compounds.
Staged n8n or Make workflow Separate steps for sourcing, scoring, approval, and outreach. Easier to test, easier to swap tools, easier to reason about.
Deterministic scripts plus agent judgment APIs and scripts handle repeatable actions. LLMs handle ranking, extraction, and personalization. Best option for production reliability.

If you plan to run the workflow more than a few times, the staged version wins.

Not because it’s elegant.

Because it’s survivable.

The cost problem nobody likes admitting

Here’s the part that quietly wrecks architecture decisions:

retries cost money.

Every time you:

  • re-score a batch
  • retry extraction
  • run fallback prompts
  • compare models
  • evaluate 50 examples
  • regenerate outreach

...the meter runs.

And that creates bad behavior.

Teams start doing things they know are wrong because per-token pricing makes iteration feel expensive:

  • avoiding evals
  • under-testing prompts
  • keeping giant prompts instead of splitting steps
  • skipping retries
  • refusing to batch experiments

That’s how token anxiety becomes a design constraint.

For agent workflows, that is brutal.

Because the correct architecture usually involves more calls, smaller steps, more testing.

Which is exactly why flat-rate compute is so attractive for automation-heavy stacks.

If you’re running agents all day in n8n, Make, Zapier, OpenClaw, or custom workflows, predictable pricing changes behavior in a good way. You stop treating every iteration like a financial decision.

That’s the real appeal of Standard Compute.

It gives you an OpenAI-compatible API, but with unlimited AI compute at a flat monthly price instead of per-token billing. So the architecture you’d build if cost didn’t constantly nag you — staged workflows, repeated evals, lots of narrow model calls — becomes practical.

That matters a lot when your workflow is doing repeated ranking, scoring, rewriting, and fallback logic.

The boring architecture is usually the advanced one

People think the sophisticated setup is maximum autonomy.

Usually it isn’t.

Usually the sophisticated setup is:

  • OpenClaw for judgment and tool use
  • n8n for orchestration
  • Exa for search
  • Gmail or HubSpot connectors for delivery
  • a smaller model for repeated scoring
  • a larger model only when needed

That is not less advanced.

That is more advanced because it respects failure boundaries.

What I would build first

Not outreach.

That’s the bait.

Build the scoring loop first.

Start here

  1. Collect 25 candidate profiles
  2. Put them in Google Sheets or a database table
  3. Define a scoring rubric
  4. Run the same prompt across all rows
  5. Compare outputs
  6. Fix the rubric
  7. Repeat

A minimal prompt is enough:

You are scoring freelancer candidates for outbound outreach.

Evaluate this candidate on:
- relevant experience
- evidence of quality work
- communication clarity
- fit for B2B SaaS work

Return JSON with:
- score (0-100)
- decision (reject/review/approve)
- reason (max 30 words)
Enter fullscreen mode Exit fullscreen mode

Then, if you’re using OpenClaw, actually inspect the system while it runs:

openclaw status --deep
openclaw logs --follow
openclaw doctor
Enter fullscreen mode Exit fullscreen mode

If you can’t explain why a candidate was selected, you are not ready to automate outreach.

That sounds harsh.

It is still cheaper than losing another 50 hours to a workflow that looked smart in a diagram.

The real lesson

The lesson is not that OpenClaw is weak.

The lesson is that people keep trying to compress messy human workflows into one heroic prompt.

That almost never works.

Break it apart.

Let LLMs judge.

Let scripts execute.

Let n8n or Make orchestrate.

And if you’re doing enough repeated LLM work that token pricing is warping your design, use infrastructure that doesn’t punish iteration.

That’s how these workflows stop being demos and start becoming systems.

Top comments (0)