DEV Community

Cover image for Trust, Philosophy, Context, Harness — How My AI Usage Evolved
daac-solo
daac-solo

Posted on

Trust, Philosophy, Context, Harness — How My AI Usage Evolved

What I learned building 6 autonomous agents, a knowledge engine, and a multi-model workflow.


I used to use AI as a chat assistant. Today I have 6 autonomous agents that monitor my systems every morning before I wake up.

Looking back, the journey had four distinct phases — each building on the last. This isn't a universal framework; it's just how things evolved for me. But I think the progression is natural enough that it might resonate.

Everything below is something I actually built and use daily.


Phase 1: Trust — "It Starts with Believing"

The biggest barrier to using AI isn't technical. It's psychological.

Most people try AI once, get a mediocre answer, and conclude "AI isn't useful for my work." They're using it like a search engine — ask a question, get an answer, done.

The breakthrough is a mental shift: treat AI as a collaborator, not an oracle. It has unlimited patience, broad knowledge, and tireless iteration. You have judgment, domain expertise, and taste. Together you're stronger than either alone.

What this looks like in practice

I don't say "write me a SQL query for conversion rate." I say:

"Our conversion metric spiked 2pp in the treatment arm. I need to understand — is that from more people entering the funnel, or the same people converting more often? Let's figure this out."

The AI asks clarifying questions. Proposes two hypotheses. We iterate on diagnostic queries together. Three rounds later, we find a root cause I wouldn't have found alone.

That's not "using a tool." That's collaborating.

The takeaway for beginners

Pick a real problem you're stuck on. Not a toy task — something where you'll notice if the answer is wrong. The trust builds when you see it actually help with something hard. Nobody trusts a stranger by reading their resume. You trust them by working through a tough problem together.


Phase 2: Prompt Engineering — "It's All Philosophy"

Here's the dirty secret about prompt engineering: the people who are good at it aren't using magic words. They're thinking clearly before they ask.

The prompt engineering community has produced thousands of templates — "chain of thought," "few-shot," "role-playing." These work, but they don't scale. When the situation changes, the template doesn't adapt.

What does adapt? Thinking frameworks.

Two philosophical tools

1. Socratic questioning — clarify before you ask

Before writing any prompt, answer these:

  • "What problem am I actually solving?" (not "what prompt should I write?")
  • "What would have to be true for this approach to work?"
  • "What's the strongest counterargument?"

This sounds obvious. But most bad prompts come from unclear thinking, not bad phrasing.

2. First principles + Occam's razor — solve the real problem simply

  • Start from the goal, not from convention ("how was this done before?" is usually the wrong question)
  • Prefer the simplest solution that fully solves the problem
  • Don't add complexity for hypothetical future needs

How I encoded this

I wrote these into my AI assistant's system instructions — a file called CLAUDE.md that loads on every session:

## Thinking Defaults
- First principles: ask "what problem are we solving?" before
  "how was this done before?"
- Occam's razor: prefer the simplest solution that fully solves
  the problem
- Socratic questioning: probe deeper before accepting the frame.
  Ask "why does this matter?", "what would have to be true?"
Enter fullscreen mode Exit fullscreen mode

These aren't prompts. They're thinking instructions — they apply to every task, every domain. I wrote them once and they've shaped hundreds of interactions.

The takeaway

Learn 3 questions instead of 300 prompts. Philosophy scales; templates don't.


Phase 3: Context Engineering — "Make the AI Know What You Know"

A single prompt is a one-shot. Context engineering is about systematically giving AI the right information at the right time — across sessions, across tasks, across models.

This is where the real leverage is. Karpathy was right — the new skill in AI is context engineering, not prompt engineering.

I built three pillars:

Pillar 1: Progressive Disclosure

Don't dump everything into the prompt. Layer it:

Always loaded:  CLAUDE.md (thinking rules, voice, hard constraints)
                    |
Task-relevant:  8 domain skill files (experiment analysis,
                SQL gotchas, stakeholder communication...)
                    |
Just-in-time:   auto-search hook — surfaces relevant prior
                findings on every message
Enter fullscreen mode Exit fullscreen mode

The auto-search hook is the key innovation. I built a semantic search engine over 350+ indexed findings from past work. It runs automatically on every message I send — searches the index, surfaces the top 3 relevant results as additional context. ~1 second latency. I never ask for it. It just works.

# The hook script (simplified)
PROMPT=$(extract_from_stdin)
RESULTS=$(ke search "$PROMPT" -n 3)
if has_relevant_results "$RESULTS"; then
  echo "{\"additionalContext\": $RESULTS}"
fi
Enter fullscreen mode Exit fullscreen mode

The insight: AI doesn't need everything. It needs the right things at the right time.

Pillar 2: Memory Management

Knowledge flows through three durability levels:

Raw observation  -->  Indexed finding  -->  Skill file entry
  (one session)      (searchable)          (permanent)
Enter fullscreen mode Exit fullscreen mode

After every substantive task — debugging sessions, data analysis, root cause investigations — the AI auto-captures findings:

ke add "sql-gotchas" \
  "LATERAL VIEW EXPLODE can't be followed by LEFT JOIN in same FROM" \
  -t "databricks,sql,parse-error" \
  -s "debugging session 2026-03-15"
Enter fullscreen mode Exit fullscreen mode

Findings that prove useful across multiple sessions get promoted to permanent skill files. The system evolves organically through real work, not batch maintenance.

Pillar 3: Multi-Model Workflows

Different models have different strengths. I built a pipeline that uses three:

/tri-ai-workflow:
  Claude plans  -->  Codex codes  -->  Gemini reviews
  (architecture)    (implementation)   (adversarial review)
Enter fullscreen mode Exit fullscreen mode

And a completion gate that forces the right quality check based on task type:

/done triggers:
  SQL task     --> 9-point data sanity checklist
  Experiment   --> 7-point analysis checklist
  ETL pipeline --> 8-point review checklist
  Code         --> Gemini review score >= 90
Enter fullscreen mode Exit fullscreen mode

No task gets declared "done" without passing its gate.

The takeaway

Context engineering is workflow design. It's your expertise, encoded so AI can participate at your level.


Phase 4: Harness Engineering — "Build Stable Systems, Not Clever Prompts"

This is where I am now. AI moves from "tool I use" to "system that runs." The hard problems are no longer prompting — they're reliability, observability, and failure handling.

I call this "harness engineering" — the engineering discipline of building the scaffolding that makes AI agents work consistently in production.

What I built in 3 days

Six autonomous agents that run every morning:

Agent What it monitors Schedule
Data Freshness upstream table staleness (45+ tables) 6:30 AM
Business Metrics daily revenue/conversion vs 7d/28d baselines 7:00 AM
Growth Tracker daily migration metrics 7:15 AM
Experiment Health enrollment drift, arm balance, platform coverage 7:30 AM
Portfolio Monitor weekly revenue/impression share WoW + incident detection 6:00 AM
System Monitor ranking input anomalies + impression tracking across 6 surfaces 9:00 AM

Each agent: reads from a data warehouse, reasons about what it finds, and posts a summary to Slack. If something's anomalous, it flags it with context on why.

This sounds fancy, but the architecture is boring on purpose.

The architecture is deliberately boring

prompt file + cron + `claude -p` --> Slack
Enter fullscreen mode Exit fullscreen mode

That's it. No LangGraph. No multi-agent orchestration framework. No agent-to-agent communication. Each agent is independent.

I evaluated two frameworks — one with 32 specialized agent roles, another with full LangGraph orchestration — and rejected both. They solve problems I don't have yet. The simplest architecture that works is the right one (Occam's razor, all the way down).

Five lessons from building this

1. Environment matters more than prompts.

All agents failed on day 1. Not because the prompts were wrong — because the cron scheduler's config was missing environment variables for cloud authentication. The prompt was perfect; the harness was broken.

This is the defining insight of harness engineering: most failures aren't prompt failures. They're infrastructure failures.

2. Domain knowledge prevents false positives.

My system monitor checks for anomalies in ranking inputs. But some values are intentionally static — they're hardcoded overrides, not anomalies. Without encoding these known values in the prompt, every run produced false alerts.

The prompt needs to know what "normal" looks like. That's domain knowledge, not prompt engineering.

3. Output format = platform constraints.

Markdown tables don't render in Slack API messages. I had to switch to monospace code blocks. The AI generates beautiful markdown — but the harness decides what actually works.

4. Ordering creates reliability without coupling.

The data freshness agent runs at 6:30, before all other agents. If data is stale, downstream agents know to expect gaps. This is dependency ordering without inter-agent communication — simple, robust, no shared state.

5. Start simple, iterate fast.

Version 1 of the system monitor showed only anomalies. V2 added a full feature summary table. V3 added impression tracking across all 6 surfaces with per-surface breakdown. Each version shipped the same day, informed by the previous version's gaps.

Open questions

  • Cross-run memory: agents don't remember yesterday's findings. How do you give them session persistence without overcomplicating the architecture?
  • Failure alerting: if an agent errors, it just... doesn't post. You notice the absence. There's no alerting on the alerter.
  • Inter-agent signals: when one agent detects an anomaly, can another agent use that signal as context? Without building a message bus?

The takeaway

Harness engineering is about making AI reliable, not clever. Environment setup, scheduling, false positive suppression, output formatting — that's where the real work is.


Looking Back

Harness Engineering  -- stable autonomous systems
Context Engineering  -- right info, right time, across sessions
Prompt Engineering   -- philosophical clarity before asking
Trust                -- believing you can solve problems together
Enter fullscreen mode Exit fullscreen mode

Each phase built on the previous. I couldn't have invested in context engineering without trusting AI enough to build a workflow around it. And harness engineering wouldn't work without the context layer making outputs reliable.

This was my path — yours might look different. But I think the layers stack naturally. You don't graduate from one to the next; you keep using all of them.


I'm currently exploring the open questions in harness engineering — if you're building autonomous agents and have solved cross-run memory or failure recovery, I'd love to hear your approach.

Top comments (0)