DEV Community

Tiphis
Tiphis

Posted on

What developers get wrong about AI code agents (and how to fix it)

What developers get wrong about AI code agents (and how to fix it)

AI code agents are now good enough to write non‑trivial features, refactors, and migrations—but most teams adopt them like a smarter autocomplete. They paste a big prompt, hope the agent “gets it,” and then judge the whole category by the first broken PR.

That’s the wrong mental model.

An AI code agent is less like a senior engineer and more like a fast junior engineer with (a) perfect recall of public patterns, (b) imperfect understanding of your codebase, and (c) the ability to take actions very quickly. If you don’t wrap that speed in constraints, you get speedrunning toward subtle bugs.

The good news: reliability is mostly fixable with workflow design. The model matters, but your process matters more.

This matters now because agents are moving from “suggest” to “do”: they run commands, open files, edit multiple modules, and push diffs. As soon as a tool can change your repo, you need the same discipline you apply to CI/CD and production access.

Below is a practical playbook you can implement this week.


The core misunderstanding

Developers treat agent output as content (“generate code”), when they should treat it as a change proposal (“produce a diff that must satisfy constraints”).

Content workflows optimize for fluency. Change‑proposal workflows optimize for correctness.

If you only change one thing: make your agent always produce a patch that must pass tests and match an explicit contract.


7 practical fixes (checklists you can copy)

1) Force a contract before code: “Definition of Done” in 10 lines

Before the agent writes code, require it to restate:

  • Objective (1 sentence)
  • Non‑goals (2–5 bullets)
  • Public API changes (if any)
  • Failure modes to avoid (2–5 bullets)
  • Acceptance checks (tests, commands, or observable behaviors)

Why it works: most agent failures are scope drift. A contract turns “vibes” into verifiable criteria.

Checklist: If you can’t turn the request into acceptance checks, don’t ask the agent to implement yet—ask it to propose checks first.

2) Always work in a sandbox branch + require a diff

Agents should never “edit main.” Make the default workflow:

  • create branch
  • make incremental commits
  • open PR (even locally)
  • show a diff summary

Why it works: humans review diffs. They don’t review 600 lines of pasted code.

Checklist: Require: “List touched files + why each changed” and “Top 5 risky lines” (the agent can highlight them).

3) Constrain tools: explicit allowlist of commands

If your agent can run tools (shell, package manager, DB, kubectl), you need a policy.

Start with an allowlist:

  • ✅ read-only: ls, cat, rg, git diff, git status
  • ✅ safe build/test: npm test, go test, pytest, mvn -q test
  • ⚠️ gated: npm install, pip install, codegen, migrations
  • ❌ forbidden by default: rm -rf, kubectl delete, prod credentials, writing to global config

Why it works: many “agent mistakes” are one bad command.

Checklist: Separate analysis tools from mutation tools. Require confirmation for mutation.

4) Put tests in the loop: “red → green” as the agent’s job

Tell the agent it is responsible for:
1) writing/adjusting tests that encode the contract
2) making them fail (or at least showing the current failing behavior)
3) implementing code until tests pass

If the repo has weak tests, start with:

  • golden file tests
  • snapshot tests (carefully)
  • minimal integration test around the change

Why it works: tests are the only scalable truth source.

Checklist: No PR from an agent without: test command + output snippet.

5) Make the agent explain tradeoffs and alternatives

Require a short design note:

  • Option A (chosen): why
  • Option B: why rejected
  • Risks + mitigations

Why it works: agents can generate plausible code for multiple approaches; you need the reasoning surfaced.

Checklist: If the agent can’t name an alternative, it probably didn’t search the problem space.

6) Add “tripwires” for common hallucinations

Agents routinely:

  • call non-existent methods
  • use the wrong config key
  • assume a library exists
  • misuse concurrency primitives

Tripwires you can add:

  • rg validation step: “show me where this function/type exists in the repo”
  • dependency check: “confirm this package is in lockfile”
  • compile/typecheck step (even for dynamic langs: mypy/pyright, tsc)

Why it works: you replace guessing with verification.

Checklist: Every new symbol introduced must be proven to exist (or created) with a repo search.

7) Instrument for audit: log prompts, diffs, tool calls

If you want to use agents seriously in a team:

  • record prompt + model + temperature
  • record tool calls and outputs
  • record final diff and test results

Why it works: when something breaks, you need to reproduce the workflow, not just the code.

Checklist: Store an “agent run transcript” next to the PR (as an artifact or comment).


A concrete example workflow: “Agent-assisted refactor with guardrails”

Scenario: You want to refactor a Node/TypeScript service to replace a hand-rolled retry loop with a shared retry() utility, without changing behavior.

Tools

  • Git
  • ripgrep (rg)
  • TypeScript compiler (tsc -p .)
  • Test runner (pnpm test or npm test)
  • Optional: a code agent that can edit files + run allowlisted commands

Steps

1) Write the contract (human + agent)
Prompt the agent:

  • “Restate the contract, list non-goals, and propose 3 acceptance tests.”

Example acceptance checks:

  • unit: retry stops after N attempts
  • integration: existing API call still returns same error mapping
  • static: tsc passes

2) Create a sandbox branch

  • git checkout -b agent/refactor-retry

3) Locate all call sites (verification step)

  • rg "while \(attempt" src/ (or whatever pattern matches)
  • Agent must list the files and classify by risk (hot paths, payments, auth).

4) Add/lock tests first

  • If tests exist, add cases around current behavior.
  • If tests don’t exist, add a minimal harness using dependency injection or nock.

Pitfall: agents love rewriting tests to match their implementation. Guardrail: keep tests describing old behavior.

5) Implement utility + incremental adoption

  • Create src/utils/retry.ts
  • Migrate one low-risk call site
  • Run tsc + tests
  • Commit: “Introduce retry utility”
  • Repeat call site by call site

Pitfall: the agent may “simplify” error handling. Guardrail: require a diff review focused on catch blocks and error mapping.

6) Diff review checklist (human)

  • Any change in exception types or messages?
  • Any change in timeouts/backoff?
  • Any new dependency?
  • Any concurrency behavior change?

7) CI gate + rollback plan

  • Ensure the PR can be reverted cleanly (small commits help).

This workflow is intentionally boring. That’s the point: boring is reliable.


Costs / latency / quality tradeoffs (what to optimize for)

You can push an agent to be faster, cheaper, or higher quality. You rarely get all three.

  • Lowest latency: single-shot prompts, minimal tool use. Fast, but highest risk of hallucinations and missed repo constraints.
  • Lowest cost: smaller models, fewer tool calls, fewer iterations. Fine for boilerplate and documentation; risky for refactors.
  • Highest quality: multi-step agent loop with repo search + compile + tests + diff review. More tokens, more tool calls, more wall time—but the first PR you don’t have to revert pays for it.

Rule of thumb:

  • If the change touches auth, payments, data migrations, concurrency, or security: pay the latency/cost for verification loops.
  • If it’s pure UI copy or isolated utility code: optimize for speed.

A practical budget pattern:

  • “Scout pass” (cheap model) to locate files and propose contract/tests
  • “Build pass” (strong model) to implement with tool loop
  • “Reviewer pass” (strong model or human) to do diff-based risk review

Closing: treat agents like CI for code changes

AI agents aren’t magic; they’re automation. And automation is only as safe as its guardrails.

If you adopt the seven fixes above—contracts, diffs, tool allowlists, tests-in-loop, alternatives, tripwires, and audit logs—you’ll stop arguing about whether the model is “good enough” and start shipping changes with predictable reliability.


Tip jar

If this saved you time (or prevented a rollback), you can tip:

Wallet: 0xAa9ACeE80691997CEC41a7F4cd371963b8EAC0C4

Top comments (0)