The first time I gave an AI agent real autonomy on a production codebase, it confidently refactored a utility method that happened to share a name with a method in a Feign client interface six modules away. The code compiled cleanly. My unit tests passed. Staging broke in a way that took two hours to trace because the JSON serialization behavior had subtly changed.
That was roughly hour 200. I've now crossed 1,000 hours of daily use across projects — Spring Boot microservices, a Flutter mobile app, Python data pipelines, some Go tooling. The workflow I use today is unrecognizable compared to what I started with, and most of what I thought I knew in those first few months was wrong.
The gap between "AI assistant that occasionally saves time" and "force multiplier that ships reliably" isn't about which model you use or which IDE plugin you install. It's almost entirely about how you structure the work before you hand it off.
I'm going to describe what actually works. Not the ideal case — the real case, including where it fails.
The task scoping problem nobody talks about
The most common mistake I see from developers new to agents is giving them goals instead of tasks. "Add authentication to this service" is a goal. An agent handed a goal will make dozens of implicit decisions: which library to use, where to put the filter, how to handle token expiry, whether to add a config property or hardcode something. Each decision is individually reasonable. Collectively, they often produce something that doesn't fit your codebase at all.
The mental model shift that helped me most: treat the agent like an extremely fast junior developer who has read your entire codebase but has no knowledge of your team's unwritten conventions. They'll do exactly what you said, quickly, and be confused why you're upset.
A well-scoped task has a specific file or class to modify, the exact behavior that needs to change (not the outcome you're hoping for), explicit constraints ("don't change the method signature", "stay within this module", "don't add new dependencies"), and a clear definition of done you can verify yourself in under two minutes.
Instead of "add rate limiting to the user endpoints", I write: "Add a rate limiting filter to UserController.java. Use Bucket4j — it's already in the pom. Rate limit to 100 requests per minute per IP using the X-Forwarded-For header. Add a test in UserControllerTest that verifies a 429 is returned on the 101st request within a sliding window. Don't touch any other controllers." That's a task. The first thing was a wish.
Loading context is not the same as prompting
Early on I treated context like a formality — paste in the file, ask the question. What I've learned is that the context you load shapes the entire response, not just the part you're asking about.
For anything non-trivial, I now explicitly load the file being modified, its direct dependencies (the interfaces it implements, the classes it calls), the relevant test file, and any configuration that affects behavior — the relevant application.yml sections, env variables, that kind of thing.
I also say out loud what the agent should NOT need to look at. "Ignore the other controllers. The auth logic is handled upstream in the filter chain — you don't need to worry about it." This sounds redundant but it prevents the agent from going exploring in directions that add noise to the output.
When working in a large Spring Boot monolith, I'll often start a session by describing the module structure explicitly: "This project has five modules. We're only working in user-service. The common module has shared DTOs — you can read it but don't modify it." A few sentences of orientation saves many paragraphs of correction later.
Always plan before you code
The pattern that changed my output quality the most: never go straight to code on anything that touches more than one file.
With Claude Code I use /plan or just ask explicitly for a plan before any implementation. Not because I don't trust the agent to code — but because catching a wrong assumption at the plan stage costs ten seconds. Catching it after the agent has modified seven files costs twenty minutes of untangling and a lot of git checkout.
A plan review also surfaces things I forgot to mention. If the agent's plan includes "add a new UserRepository method", I realize I forgot to say we use a custom JPQL query and we don't add raw methods to the repository interface. That correction takes one sentence before coding. After coding, it's a rewrite.
For tasks that span more than one logical step, I'll ask for the plan broken into phases: "Phase 1: create the new DTO. Phase 2: update the service. Phase 3: update the controller. Phase 4: update the tests." Then I review each phase before proceeding. This is slower than letting it run, but "slower" means an extra three minutes, and it eliminates the whole class of errors where step 4 assumes something about step 2 that's already wrong.
What to never delegate
This is the part most productivity takes leave out.
Database migrations. I write every Flyway migration script myself. An agent will generate syntactically correct SQL that does the wrong thing to your data, and you often won't catch it until you run it. The cost of a wrong migration is too high.
Security logic. JWT validation, permission checks, role hierarchies. I'll let the agent scaffold the structure, but I write the actual predicate logic myself. It's not that agents are bad at it — it's that I need to understand every line of security-sensitive code personally, and "the agent wrote it and I reviewed it" isn't the same as understanding it.
Anything touching shared state in a concurrent context. Thread pool sizing, cache invalidation, queue consumer configuration. I've watched agents write perfectly reasonable-looking code in this area that had subtle race conditions surfacing only under load. Spring's @Async behavior has enough gotchas around exception handling and thread context propagation that I don't trust generated code here without very careful review.
API contracts published to other teams. If I'm changing a REST endpoint or a Kafka message schema that another service consumes, I write the change myself. Contract changes need human intent.
The actual workflow
Here's what a typical feature task looks like now:
1. Write the task spec (bounded, with explicit constraints)
2. Load relevant context explicitly
3. Ask for a plan — review it, correct it
4. Execute phase by phase, reviewing output between phases
5. Run the tests the agent wrote, then run the broader test suite
6. Read the diff, not the agent's summary
7. Commit with a message I write myself
Step 6 is worth repeating: read the actual diff, not the agent's description of what it did. Agents are optimistic narrators. The diff is the ground truth.
The phase execution loop is where most of the real work happens. On a task touching four files, I'll typically have one or two corrections mid-way. That's normal. The correction looks like: "The service method is correct but don't call userRepository.save() directly — use UserService.update() which already handles audit logging. Revise phase 3."
Patterns that work in Java/Spring Boot land
Test first, always. When I ask an agent to add a feature, I almost always ask it to write the test first. Not for TDD philosophy — because a test forces the agent to think about the interface before the implementation. The test spec is a contract. I review it, approve it, then ask for the implementation.
// Ask for this first:
@Test
void shouldReturnUnauthorizedWhenTokenExpired() {
String expiredToken = tokenGenerator.generateExpiredToken(userId);
mockMvc.perform(get("/api/users/me")
.header("Authorization", "Bearer " + expiredToken))
.andExpect(status().isUnauthorized())
.andExpect(jsonPath("$.code").value("TOKEN_EXPIRED"));
}
If the agent can write that test clearly, it understands the requirement. If it hedges or makes assumptions in the test itself, I go back and clarify before going further.
Name your constraints explicitly in the prompt. Spring's ecosystem has a lot of ways to do the same thing. "Add caching" could mean @Cacheable, Redis directly, Caffeine, a manual ConcurrentHashMap. I name the technology: "Use @Cacheable with our existing Redis CacheManager bean. Cache name is user-profiles. TTL is already configured in CacheConfig.java."
Ask for the unhappy path. Default agent output handles the happy path well and glosses over error cases. I ask explicitly: "Also handle the case where the external payment service returns a 503. Retry once after 500ms using @Retryable, then throw a PaymentServiceUnavailableException that the controller maps to a 502."
When it breaks — and what to do
Agents go off-rails. It happens less at hour 1,000 than it did at hour 200, but it still happens. The failure modes I see most:
Scope creep. The agent "fixes" something nearby that it noticed while working. The fix is usually not wrong, but it's unexpected and untested. My defense: explicit "do not change anything outside of [specific files]" language in the task, plus reading the diff carefully.
Hallucinated APIs. Especially in less-common libraries, agents will confidently use methods that don't exist. In Spring, this tends to happen with newer APIs or module-specific features. Running the code is the only reliable check — code review misses it sometimes.
The test that tests nothing. An agent writes a test that passes trivially because it's testing a mock returning a mock. I check by asking: "What production behavior would break this test if I deleted it?" If the answer is "nothing," the test is useless.
When a session goes badly wrong — multiple phases deep into a mess — I don't try to patch it. I discard, go back to the last clean commit, and restart with a more constrained task spec. Fighting a bad trajectory is slower than resetting. This took me too long to learn.
The honest productivity picture
After 1,000 hours, my throughput on certain task types is genuinely higher. Boilerplate-heavy work — DTOs, controller scaffolding, Flyway migration stubs, test setup code — goes roughly four times faster. Complex logic involving domain rules, concurrency, or security decisions goes maybe 20% faster because the agent handles the typing while I handle the thinking.
There's also a category where it's slower: anything where I spend more time specifying and reviewing than I'd spend just coding. For a ten-line method in a well-understood domain, writing a good prompt takes longer than writing the method. So I just write the method.
The productivity gains are real. They compound only when you stop treating the agent as a magic box and start treating it like a capable collaborator who needs clear direction, bounded scope, and explicit verification at each step.
What's your experience with task scoping? I'm curious whether the "goals vs tasks" distinction resonates with other teams, or whether there's a completely different framing that works better in your context.
Originally published on Medium.

Top comments (0)