Lars Winstand

Posted on Jun 27 • Originally published at standardcompute.com

I stopped trusting app dashboards and used a browser automation AI agent to rebuild the numbers from scratch

#ai #automation #devops #productivity

Dashboards are great right up until they quietly lie to you.

I like a clean admin screen as much as anyone. Green checks. Nice totals. A chart drifting upward like the database has never seen a duplicate row in its life.

But some of the worst ops mistakes I’ve seen started with the same sentence:

“The dashboard says we’re fine.”

That’s why a small Reddit example stuck with me. In a thread on r/openclaw, someone said they used OpenClaw to fill out Garmin’s device-sync worksheet from their own activity history instead of trusting the app screen.

That’s a tiny use case. It’s also one of the clearest examples of what AI agents are actually good at.

Not writing tweets.
Not roleplaying as your coworker.
Not summarizing a summary.

The useful move is this:

Have the agent go back to source records and reconstruct the answer itself.

That turns the agent from a chatbot into a verification layer.

And for developers building automations, that’s way more interesting.

The chat part is the least interesting part

Most people still picture an agent as a chat UI with a few tools attached.

That framing misses the real value.

The important part is not that GPT-5 or Claude can answer in natural language. The important part is that an agent can inspect:

Gmail threads
Slack messages
SQLite or PostgreSQL rows
CSV exports
Google Sheets
app activity logs
calendar events

Then it can compare those records to whatever your dashboard claims happened.

That’s the architectural shift.

If the agent can access the underlying records directly, it does not need to trust one app’s summary screen.

For verification workflows, that’s the difference between:

“read the number on the page”
“compute the number from evidence”

I trust the second one a lot more.

Why dashboards are often the wrong source of truth

Dashboards are optimized for readability and speed.

They are not optimized for forensic accuracy.

A dashboard number might be:

cached
delayed
filtered
deduplicated
rounded
based on business rules you forgot existed

That’s fine when you’re checking a rough trend.

It’s not fine when you’re deciding:

whether a customer was contacted
whether a sync job actually completed
whether your CRM matches your inbox
whether support backlog is growing
whether a billing report is safe to send

The Garmin example works because it’s painfully familiar: the app screen said one thing, the history said another, so the user rebuilt the answer from the underlying activity.

That’s the pattern.

Don’t ask AI to trust the dashboard. Ask AI to check the receipts.

The stack that makes this work

While digging through agent workflows, I found another r/openclaw discussion that explained the integration problem better than most vendor pages do. One commenter broke it into tiers: native tools, MCP connections, and managed OAuth layers like Composio.

That’s the real design question.

Not “which model is smartest?”

The better question is:

How directly can this agent access the records I actually trust?

Here’s the practical version.

Option	What it’s best at
OpenClaw	Local-first agent control plane, model routing, failover, and operational visibility
MCP	Connecting agents to files, databases, calendars, and app data so they can read raw records directly
Composio	Managed OAuth, per-user sessions, token refresh, triggers, and a huge app integration layer

My take: if you care about verification, OpenClaw + MCP + Composio is more interesting than another hosted chat app.

Why OpenClaw is a good fit for verification work

OpenClaw is interesting because it behaves more like infrastructure than a chat toy.

If I’m asking an agent to reconcile:

local exports
inbox history
SQLite rows
Slack messages
a spreadsheet someone emailed three weeks ago

I want something inspectable.

OpenClaw exposes commands that make that possible:

openclaw status
openclaw status --all
openclaw status --deep

openclaw health --json
openclaw health --verbose

That matters.

A verification layer should be debuggable. If the agent is going to tell me the dashboard is wrong, I want to know what it touched, what failed, and what source it trusted.

Where MCP becomes the useful part

MCP matters because it gives the agent a standard way to access real systems instead of scraping one screen and pretending that’s truth.

For example, if your agent can connect to:

Gmail
Google Calendar
PostgreSQL
SQLite
local files
Notion

then it can rebuild answers from source records.

That’s a much healthier pattern than “open dashboard, read total, repeat total.”

A minimal example might look like this conceptually:

const records = await Promise.all([
  gmail.getThreads({ since: "2026-06-01" }),
  slack.getMessages({ channel: "support", since: "2026-06-01" }),
  postgres.query("select * from tickets where created_at >= $1", ["2026-06-01"]),
  sqlite.query("select * from sync_events where ts >= ?", ["2026-06-01"])
]);

const normalized = normalize(records);
const result = reconcile(normalized);

console.log(result.mismatches);

The exact APIs vary, but the pattern is the same:

fetch source records
normalize them
compute the answer
compare it to the app summary
output evidence

Where Composio saves you from OAuth hell

This is the part developers underestimate until they lose a weekend to auth flows.

Composio is useful because it handles the ugly integration layer:

OAuth
per-user connections
token refresh
triggers
SDK and CLI access
lots of app integrations

That means your agent can pull from the systems teams actually use, like Gmail, Slack, Google Sheets, and Linear, without you hand-rolling auth for every connector.

Their install path is refreshingly simple:

curl -fsSL https://composio.dev/install | bash

And yes, this matters for verification. If your agent can pull raw Slack messages and compare them against CRM activity or ticket counts, you can catch the mismatch before someone forwards a wrong report.

A practical verification workflow

This is where the idea stops being abstract.

A solid reconciliation pipeline usually looks like this:

Pull source data from every system involved
Normalize IDs, timestamps, and duplicates
Ask the model to reconcile differences
Compare the model’s computed result to the dashboard value
Emit a mismatch report with links to evidence

If you’re using n8n, this is a very natural fit.

Example flow:

Node 1: fetch Gmail thread export
Node 2: fetch Slack messages
Node 3: read Google Sheets rows
Node 4: query PostgreSQL
Node 5: run reconciliation with Claude or GPT-5
Node 6: post mismatch report to Slack or email

That’s a much better use of an agent than asking it to sound clever in a sidebar.

Example: compare a dashboard metric to source records

Here’s a stripped-down Node.js example showing the shape of the workflow.

async function verifyContactCount({ dashboardCount, gmailThreads, crmRecords }) {
  const contactedEmails = new Set();

  for (const thread of gmailThreads) {
    if (thread.direction === "outbound" && thread.customerEmail) {
      contactedEmails.add(thread.customerEmail.toLowerCase());
    }
  }

  const crmTouched = new Set();

  for (const record of crmRecords) {
    if (record.customerEmail && record.lastContactedAt) {
      crmTouched.add(record.customerEmail.toLowerCase());
    }
  }

  const onlyInGmail = [...contactedEmails].filter(email => !crmTouched.has(email));
  const onlyInCrm = [...crmTouched].filter(email => !contactedEmails.has(email));

  return {
    dashboardCount,
    recomputedCount: contactedEmails.size,
    mismatch: dashboardCount !== contactedEmails.size,
    onlyInGmail,
    onlyInCrm
  };
}

That’s not fancy AI. It’s just disciplined verification.

The model becomes useful when the records are messy and spread across systems, and when you want a readable explanation of what mismatched and why.

The checks I would add immediately

Reconstructing from source records is safer than trusting a dashboard.

It is not automatically correct.

If the raw data is delayed, incomplete, malformed, or duplicated, the agent can still produce a bad answer. It’ll just do it confidently.

So if I were building this for production, I’d require the agent to report:

record counts per source
missing date ranges
duplicate IDs
source freshness timestamps
confirmed vs inferred conclusions
exact evidence rows or links for every discrepancy

That last one is the big one.

If the agent says the dashboard is wrong, it should point to the exact Gmail thread, Slack permalink, SQLite row, or CSV line that proves it.

Otherwise you’ve just replaced one opaque summary with another.

When this is worth doing

Not every workflow needs this.

Sometimes the dashboard is good enough.

You should build a verification layer when:

multiple systems disagree
the dashboard is known to lag
humans are manually cross-checking records already
the cost of a wrong answer is high
the workflow is repetitive enough to automate

Good candidates:

support ops
CRM hygiene
back-office agent workflows
sync verification
compliance-ish audit trails
billing and activity reconciliation

Bad candidates:

low-stakes vanity metrics
anything where “close enough” is actually fine

Model cost becomes the hidden blocker fast

There’s also a practical issue people avoid talking about.

Verification workflows are token-hungry.

If your agent is constantly pulling records, normalizing them, retrying, comparing outputs, and generating evidence-backed reports, per-token pricing gets annoying fast.

This is exactly the kind of workload where teams start self-censoring:

“don’t run it too often”
“skip full reconciliation on smaller accounts”
“only check the dashboard if someone complains”

That defeats the point.

Verification is most useful when it runs consistently, not when someone is nervous about the bill.

That’s why I think flat-rate inference is underrated for agentic ops work.

With Standard Compute, you get unlimited AI compute for a predictable monthly price, using an OpenAI-compatible API. That means you can plug it into existing SDKs, n8n flows, or custom agents without redesigning your stack around token anxiety.

For this kind of always-on reconciliation workflow, that pricing model makes more sense than metering every check like it’s a luxury feature.

Especially if your agents are running 24/7 across automations.

The bigger shift

The most underrated thing about agents is that the best use cases are often not about generation.

They’re about reconstruction.

Yes, model choice matters. GPT-5 is good at structured reasoning. Claude is often strong at careful synthesis. Other models can be fine depending on constraints.

But if the agent cannot access the real records, none of that matters much.

A boring agent with direct access to Gmail, Slack, PostgreSQL, SQLite, and local exports will beat a brilliant model trapped inside a dashboard tab.

That’s the shift.

Once you see it, you stop asking:

“Can AI summarize this screen?”

And you start asking the better question:

“What would the answer be if the agent ignored the dashboard completely and rebuilt it from evidence?”

That’s the version I trust.

DEV Community