Joshua Hall

Posted on Jul 9

Working with Lennie: The Reality of AI Code Supervision

#ai #design #coding

Last week, I posed a hypothesis: AI coding agents might enable experienced product managers and designers to operate across the traditional business/product/engineering divide. After generating 30,000 lines of code, I'm convinced the opportunity is real—but only for those willing to master a completely new kind of supervision.

To test this practically, I started building something I've wanted for seven years: a production-grade design system. Since Style Dictionary launched, I've been fascinated by systematic design token management, but the economics never worked. Building a proper design system takes 3-5 developers working 6-12 months—easily $300,000+ in startup resources for comprehensive testing, documentation, and Storybook examples.

My hypothesis: I can produce something sufficiently comparable in two months using agentic code generation.

After just a few days of focused work, the early results show genuine promise. The journey also revealed exactly why this approach demands serious technical knowledge to succeed.

The Logger Saga: Learning to Manage Lennie

Nothing illustrates the George-and-Lennie dynamic better than my week-long battle to implement enterprise-grade logging. I know this sounds absurdly overengineered for a design system—console.log() would probably work fine—but part of my experiment involves pushing AI supervision to production-quality limits.

Phase 1: The Enthusiastic Amateur

When a method naturally required logging, Lennie dutifully added a few console.log() calls. I asked a simple question: "Should we implement a shared Logger that can be reused across the codebase?"

Lennie correctly identified this as good architecture, then immediately started building a custom logger from scratch. I let it run for several minutes, curious to see what it would produce. The result wasn't terrible—exactly what you'd expect from someone who possesses tremendous strength but has never heard of third-party packages.

Lennie sees a problem and applies raw strength, even when finesse would work better. Rather than researching existing solutions, it enthusiastically reinvents wheels with the same eager intensity it brings to every task.

Phase 2: George Steps In

I stopped the custom implementation and redirected: "Research best practices we could apply to decompose the logger and make it robust enough for a production enterprise environment."

Once I reminded Lennie to actually research first, it recommended Pino as the foundation—exactly matching my own investigation. But I had to explicitly tell it to stop and think. Left to its own devices, Lennie would have continued building a mediocre custom solution indefinitely, convinced it was helping.

"Ma'am, specificity is the soul of all good communication." — Middleman

This redirect became my primary management technique throughout the project. Lennie possesses tremendous implementation power, but George must provide constant course correction to produce professional-quality results.

Phase 3: When Lennie Gets Excited

After establishing the Pino foundation, I asked which components we should extract for reuse across the application. Lennie suggested refactoring async and error handling processes—sensible architectural thinking.

But then Lennie got excited about the possibilities. What started as simple extraction evolved into 6-7 major async components, another 8 helper utilities, dozens of tests (unit, integration, and performance), and comprehensive documentation.

Like hearing about the rabbits, Lennie fixated on making everything perfect. It can't remember that we're building a design system, not an async processing framework. Every task becomes an opportunity for enthusiastic over-engineering unless George carefully constrains the scope.

Phase 4: When Lennie Forgets Everything

The testing phase revealed Lennie's most frustrating behaviors. Despite my constant supervision, Lennie struggled with three recurring patterns:

Lennie's Memory Loop When mocking Pino for unit tests, Lennie would attempt approach A, fail, try approach B, fail, attempt approach C, hit the context window limit, and then—with fresh "memory"—confidently suggest approach A again. I watched this cycle repeat three times before realizing Lennie had forgotten we'd already proven that approach wouldn't work. The attempts were almost exactly the size of the context window, creating a perfect amnesia loop.

Lennie's Reality Distortion Lennie confidently declared our async utilities should handle a million calls in 100ms. When I mentioned my local development container might not match this heroic expectation, Lennie seemed genuinely surprised—as if it had forgotten we were writing code for actual hardware rather than theoretical perfection.

Lennie's Console Explosion During stress testing, Lennie helpfully logged every single operation to the console—all 100,000 of them. Watching test results scroll like the Matrix was oddly mesmerizing, but completely useless for debugging. Lennie couldn't understand why this might be a problem until I explicitly guided it toward proper test logging patterns.

Most frustrating: when tests became complex, Lennie's default response was to give up entirely. "Let's simplify this test" inevitably meant "let's crush this bunny." I spent significant time redirecting: "No, Lennie, we need to fix the test appropriately, not disable it."

The Numbers Tell the Story

The logger saga provides concrete data about what this supervision approach actually delivers. My implementation took roughly 5 hours total—2 hours for basic functionality before I developed the spiral process, then 3 additional hours refining through multiple phases. I'm still not entirely satisfied with some decomposition decisions.

Building the same logger manually would have taken me 2-3 weeks, primarily because I'm not comfortable in TypeScript and would need extensive research. A mid-to-senior TypeScript developer might estimate the logger at 1-2 story points and the parallel async utilities at 3-5 points—easily 1-2 calendar weeks once tests and documentation are included.

The productivity gain isn't just about speed. In those few development days, I generated roughly 1,200 tests across the entire design system—mostly unit tests with integration and performance coverage. This represents better test coverage than any code I wrote 15+ years ago when I focused more on programming.

Writing unit tests became almost enjoyable with Lennie handling implementation. I personally hate the tedium of test creation, but with Lennie managing the mechanical work, I could focus on test strategy while listening to podcasts. Decent coverage for a module (20-60 tests) takes 30-60 minutes with minimal cognitive load.

Lennie also surprised me with sophisticated architectural suggestions I wouldn't have considered without significant research time. The SSH hardening approaches it implemented demonstrate security expertise I lack in bash such as defense-in-depth strategies, input validation techniques, and error handling patterns that prevent information leakage.

The Context Window Reality: George's Endless Patience

These experiences highlighted my biggest frustration with supervising Lennie: even with 200k token context windows, serious development burns through available memory in 45 minutes. Working across dozens of files—essential for comprehensive features like logging and testing—means constantly re-explaining everything to Lennie.

"What if there is no tomorrow? There wasn't one today." — Phil Connors, Groundhog Day

Every context window collapse feels exactly like this moment. Lennie forgets our architectural patterns, coding standards, and project goals. It re-reads recently modified files during compaction, but can't maintain the strategic context that guides good engineering decisions.

George must patiently re-explain the plan every hour, remind Lennie what we're building and why, and redirect its enthusiastic energy toward the right problems. This felt like using hand tools for precision carpentry—effective, but exhausting when you want a factory.

George's Task File Innovation

To maintain continuity across Lennie's memory resets, I developed a living task file that defines current objectives, tracks progress, and documents failed approaches. My prompts now instruct Lennie to update this file frequently with findings, todos, and progress notes.

After context window resets, Lennie can reconstruct 90% of necessary context from this single document. Combined with refined initial prompts and extensive use of the claude.md file, this approach qualitatively improved reliability and reduced the constant need for George to repeat himself.

The improvement was subtle but significant. Instead of explaining the same architectural patterns every hour, I could focus on higher-level guidance and quality review—more like supervising a forgetful but talented worker rather than teaching the same lesson repeatedly.

But this raises the critical question: what exactly does George need to know to supervise Lennie effectively?

What "Serious Engineering Knowledge" Actually Means

The logger implementation clarified exactly what technical knowledge supervising Lennie requires. These aren't advanced concepts—most represent computer science 200-level material or first-year professional experience. But George needs enough expertise to recognize when Lennie is wandering off course:

Pattern Recognition

Understanding when console.log() isn't sufficient for production
Recognizing that custom loggers are almost always unnecessary
Knowing third-party ecosystem options (Pino, Winston, etc.)

Testing Strategy

Test isolation and proper mocking patterns
Understanding that tests should validate behavior, not get disabled when things get hard
Performance expectations grounded in hardware reality
Clean test output for CI/CD compatibility

Architectural Thinking

Component extraction and reuse principles
Error handling patterns and async management
When to refactor vs. when to constrain Lennie's enthusiastic scope creep

Production Awareness

Node version targeting and feature compatibility
Monitoring and observability requirements
Enterprise-grade reliability expectations

None of this requires deep systems programming knowledge, but George needs enough experience to distinguish good approaches from mediocre ones. Lennie can implement either equally well—only human judgment prevents it from crushing the bunny.

The logger saga taught me these supervision fundamentals, but it also revealed a larger challenge: how do you review code at the pace Lennie produces it?

The Code Review Challenge

Working at this pace creates an unprecedented review burden. The logger implementation alone generated more pull requests in a few days than I typically create in months. At 10-20 PRs daily across the entire design system, traditional review processes break down entirely.

You're not just checking for bugs—you're validating architectural decisions, ensuring consistency across Lennie's memory resets, and catching the subtle signs that Lennie has wandered off-task. George must develop different skills than reviewing human-written code, where you can assume the developer remembers yesterday's decisions.

The logger experience taught me to watch for specific patterns that signal trouble ahead.

Red Flags: When Lennie Goes Off Course

After thousands of lines of review during the logger implementation and beyond, certain patterns emerged as reliable warning signs that Lennie was about to crush something:

Method Bloat in Single Files When Lennie starts adding numerous methods to one file, it usually signals missed decomposition opportunities. I'd see logger files growing utility functions for async thread management—completely different conceptual domains crammed together. The same architectural red flags you'd catch reviewing a junior developer's work, but amplified by Lennie's enthusiasm for solving everything in place.

Conceptual Drift Lennie tends to solve adjacent problems it discovers along the way. Building a logger? Might as well add custom error handling. Implementing async utilities? Let's create a performance monitoring framework too. This scope creep happens gradually and requires George's constant vigilance.

Legacy Pattern Defaults Lennie consistently defaulted to outdated approaches until I reminded it of our Node 18+ target. When implementing error handling, it reached for the verror package—a library that hasn't been updated in four years—instead of using typed errors built into Node since version 16. These patterns emerge because Lennie's training includes years of legacy solutions that were once best practice but are now obsolete.

George's Spiral Development Process

Traditional code review assumes human memory and consistent quality across iterations. Managing Lennie requires a completely different approach—what I call spiral development.

"The two most powerful warriors are patience and time." — Leo Tolstoy

I evolved a nine-phase process that sounds excessive but works remarkably well with Lennie's strengths and limitations:

Make it work — Let Lennie get basic functionality in place
Make it right — Guide Lennie to decompose code, apply best practices, integrate third-party solutions
Make it robust — Help Lennie handle edge cases and improve error handling
Make it secure — Dedicated security review and hardening (Lennie needs lots of guidance here)
Make it performant — Optimization pass focused on performance
Make it observable — Add monitoring, consistent logging, debugging capabilities
Make it tested — Comprehensive test coverage (where Lennie actually excels with supervision)
Make it documented — Both human-readable and AI-optimized documentation
Make it integrated — Refactor existing code to leverage new capabilities

Each phase takes 20-60 minutes depending on complexity, but the focused approach prevents Lennie from getting overwhelmed and trying to accomplish everything simultaneously—which inevitably leads to crushed bunnies.

The Iterative Review Advantage

Lennie responds remarkably well to repeated review requests along different dimensions. I asked it to review my SSH script for security improvements over a dozen times. Each iteration found new hardening opportunities I'd never considered—input sanitization techniques, defense-in-depth strategies, error handling patterns that prevent information leakage.

This iterative approach works because Lennie doesn't get frustrated or defensive about criticism like humans might. Ask it to review for third-party package opportunities, then security issues, then performance optimizations, then architectural improvements. Each lens reveals different improvement opportunities that George can guide Lennie toward implementing.

The SSH script eventually ballooned from a simple setup utility to a comprehensive, hardened installation process. Probably overkill, but it demonstrated that genuinely secure code is achievable when George provides patient, systematic guidance—just not on the first or second pass.

Managing Context Across Memory Resets

To maintain continuity across context window collapses, I developed a structured task management approach using markdown files with frontmatter metadata.

Each task follows a template covering objective, problem statement, success criteria, method guide, implementation requirements, technical decisions, and progress notes. The goal is providing sufficient context to keep the agent on track without overwhelming it with excessive documentation.

I use a three-digit, zero-padded naming convention (e.g., 003-implement-logger-utilities.md) that allows the agent to automatically pick up the next task numerically. This supports adding and reprioritizing tasks even while work is in progress—essential for planned experiments with parallel agent coordination.

The format evolves rapidly based on results. If a task doesn't proceed well, I can modify the template before the next session. This flexibility allows continuous improvement in agent guidance without requiring complex tooling.

Tool Reliability and Workflow Adaptation

Claude Code crashes in VS Code roughly once daily—annoying but manageable. I suspect this relates more to VS Code's terminal handling extended histories rather than Claude itself. Sessions with 10,000+ lines of history seem to trigger instability.

The crashes rarely destroy significant work since I maintain frequent commits and detailed task files. Opening a fresh console session usually resolves the issue without losing context. I did encounter one spectacular failure where aggressive file watching during testing (adding 500k-1M watchers) overwhelmed Docker's volume interface, but that was clearly my architectural mistake rather than a tool limitation.

These reliability issues reinforce the importance of systematic context management. The task file system becomes even more critical when tools fail unexpectedly.

The Succession Challenge and Learning Pathways

My logger experience reinforced concerns about entry-level developer training. Lennie consistently made mistakes that any experienced developer would catch immediately, but might confuse someone still learning fundamentals.

We're creating a world where senior professionals can move faster than ever, while eliminating traditional pathways for developing the expertise required to supervise Lennie effectively. This isn't sustainable long-term.

For Designers and Product Managers: My Advice on Getting Started

The learning path varies dramatically based on your existing technical background. Those with engineering experience will naturally fare better, but I've seen the barriers aren't insurmountable for others willing to invest the effort.

Avoid the Code-Hidden Temptation I get why many gravitate toward code-free solutions like Replit for quick prototyping. These tools work well for weekend experiments and stakeholder conversations, but they come with serious caveats I've learned the hard way.

I've watched sales teams sell features they saw in slide decks of possible future projects—nowhere near production-ready. High-fidelity AI-generated prototypes can fool non-technical stakeholders into believing complex features are nearly complete. In software companies, most people understand this limitation. In traditional industries, you might find yourself explaining to an executive why the working demo they just used is still 3-6 months from customer deployment. Trust me, that's not a conversation you want to have.

If You're Ready for Production Development For serious development efforts using tools like Claude Code, the initial setup presents the biggest hurdle. It took me most of a day to configure my development environment, despite substantial experience with containerization and DevOps.

DevOps remains Lennie's weakest area. Lennie can write decent Dockerfiles but produces terrible Docker Compose files. The YAML looks correct, but environment variables and specific configurations are consistently wrong. Since compose arrangements are highly context-specific with limited training examples, current models simply lack sufficient data to handle this complexity reliably.

My Practical Getting Started Advice

Keep environments simple initially — avoid complex DevOps configurations until you're comfortable with basic workflows
Choose constrained projects — a blog built on 11ty using Tailwind and deployed to Cloudflare Pages provides clear boundaries and well-documented patterns
Expect the deployment gap — you might have something working locally in days, but deployment could take a week
Use Lennie as a learning tool — ask questions like "Why did you use different error handling in method X versus method Y?" It often provides better explanations than most professors
Local experiments cost nothing — try building something to understand the process before worrying about production concerns

The Next Generation Challenge

I've been experimenting with letting my younger children build projects using Lennie, with the requirement that they explain how the code actually works when finished. My wife and I are still debating the pedagogical merits of this approach, but it mirrors the advanced calculator problem in mathematics.

I want engineers who look up equations every time rather than memorizing formulas—forgetting a factor when building a bridge has catastrophic consequences. But those same engineers need deep pattern recognition to make intuitive leaps and develop architectural thinking.

The challenge becomes teaching both tool usage and fundamental understanding without creating professionals who only know which buttons to push. We can't ignore these tools, but we also can't skip teaching rudimentary skills.

Economic and Workflow Implications

The productivity gains from my logger experience translate to substantial economic impact. My 5-hour implementation versus 2-3 weeks of manual development represents roughly $1,000 versus $20,000 in time value. Even using offshore resources at $10,000-$25,000 monthly, we're looking at 5x cost savings and 15x time improvements.

But the real change isn't just speed—it's the cognitive load shift. When supervising Lennie rather than coding directly, I spend focused attention on planning and reviewing, with implementation feeling more like being a passenger who occasionally gives directions or shouts "stop" when Lennie veers toward a playground. During highway stretches—testing and documentation phases—I can listen to podcasts while Lennie handles the mechanical work.

This change in mental effort distribution appeals to me more than traditional coding. I've always preferred architectural thinking over syntax research, so having Lennie handle mundane implementation details reduces frustration rather than creating it. The supervision challenge becomes about strategic guidance rather than tactical execution.

Team Structure and Role Evolution

The implications for team composition remain unclear from my limited experiment, but early patterns are emerging. For professionals who can operate across traditional role boundaries—the supposed unicorns—this creates unique productivity opportunities.

I expect low-fidelity mockups will become less common, continuing a 15-year trend from Balsamiq wireframes to high-fidelity Figma designs. Some designers will resist this shift because visual decoration can obscure core UX problems. But it's entirely feasible to build design systems that allow toggling between working prototypes and wireframe views—something I'm actively exploring with Terroir DS.

Smaller teams will accomplish more, but whether organizations invest in additional projects or reduce headcount depends on leadership philosophy. Many will choose layoffs because they're easier to justify to shareholders. I suspect this is usually the wrong path, but it's predictable.

The Convergence Is Real, But George Is Everything

The George-and-Lennie dynamic isn't a temporary limitation—it's the fundamental nature of working with AI coding agents. Even as models improve, someone must maintain strategic context, make architectural decisions, and ensure quality standards. Lennie can crush technical implementation with incredible power, but George's judgment determines whether it crushes the right problems or destroys the work entirely.

For product managers and designers willing to develop these supervisory skills, the convergence opportunity is transformative and immediate. The technical knowledge barrier is real but surmountable, especially for professionals with existing engineering exposure. My spiral development process, task file systems, and iterative review approaches provide concrete frameworks for learning to manage Lennie effectively.

The economic case alone justifies the investment: 15x time improvements and 5x cost reductions fundamentally change what's possible for small teams. But the real opportunity lies in role fluidity—the ability to move seamlessly between strategic product thinking and tactical implementation without losing creative momentum.

The trust required here isn't blind faith in Lennie's capabilities, but confidence in your own supervision skills. Trust that you can recognize when Lennie veers off course, redirect effectively, and maintain architectural coherence across memory resets.

This isn't about replacing engineers or eliminating the need for technical expertise. It's about enabling experienced professionals to operate across traditional domain boundaries when speed and resource constraints demand it. The succession challenge remains real—we need new apprenticeship models that teach both fundamental technical judgment and Lennie supervision skills.

"The secret to getting ahead is getting started." — Mark Twain

But for those ready to embrace the George role, the convergence opportunity is already here.

Looking Ahead: Building at Scale

My Terroir Design System experiment has made material progress from initial hypothesis in less than a week. The logger saga represents just one component in a broader architectural exploration that pushes AI supervision to enterprise-scale limits.

Rather than following traditional MVP approaches, I'm using Terroir as a vehicle to experiment with different AI supervision techniques—meandering somewhat, but deliberately so. This allows me to explore edge cases and architectural possibilities that might not emerge from more constrained development approaches.

In my next post, I'll dive into the technical ambitions driving Terroir—type-safe design tokens, automated documentation generation, comprehensive testing strategies, and the architectural patterns that make design systems truly scalable. This discussion will target engineers and design system specialists interested in the technical possibilities AI supervision enables.

The question isn't whether convergence professionals can build simple applications—we've proven that. The question is whether we can produce enterprise-grade solutions that compete with dedicated development teams. Terroir is my attempt to find out.

This is the second in a series exploring how AI coding agents are reshaping product development. Next: the technical deep dive into building production-grade design systems with AI supervision.

For product managers and designers: What convergence opportunities are you seeing in your current role? Have you experimented with AI coding tools, and what supervision challenges did you encounter? I'm particularly interested in hearing from professionals who've started bridging the traditional role boundaries.

For engineering leaders: How are you adapting team structures and skill development to accommodate AI-enhanced productivity? What new apprenticeship models are you considering for the next generation?

DEV Community