Marcin Ostrowski

Posted on Feb 3 • Originally published at rubyonai.com on Feb 3

How do you know the software is working?

#ruby #rails #ai #webdev

Hiya!

I'm back! I feel that I owe you an explanation of what's going on here.

I started this blog because I want to share my insights on agentic coding from the perspective of a developer, CTO, CEO and founder. I plan to cover the entire autonomous AI coding journey.

Throughout this series, we'll get our mindset right about the many roles you'll take on: product designer, project manager, tech lead and quality assurance engineer. Later, I'll take you through a brainstorming session. Once we have a feature specification in place, we will learn how to manage a group of coding agents. We'll learn how to enforce the rules and, most importantly, why they're important. By the end, you will be confidently shipping AI-generated code to production. We will be doing some 'vibe coding' in production.

Not necessarily in that order.

Buckle up. Here's post #2.

In previous post we covered how to make Claude stick to conventions (tl;dr - skills + hooks fix it). Now it follows the rules but...

Marcin, all tasks are complete.

I open a browser and see:

NoMethodError: undefined method 'hallucinated_method' for an instance of User (NoMethodError)

Yeah, good job, Claude! High five, let's ship it to production... NOT.

This brings us to a fundamental question: how do you know the software is working?

Back in 2017, I was working on a payment processor for a company called Paladin Software. We were processing huge YouTube earnings spreadsheets (yes, gigabyte-sized CSV files). My job was to ensure that we did it on time and that it simply worked.

One beautiful Thursday afternoon, I headed to my favourite spot in Krakow at the time, Dolnych Młynów. Friday was a day off.

As you might have guessed, one of the clients uploaded their spreadsheet on Friday. When I got back to the office on Monday, the earnings still hadn't been processed. Questions were being asked.

I'm looking at Sidekiq's failed jobs queue:

NoMethodError: undefined method 'hallucinated_method' for an instance of User (NoMethodError)

Was I an LLM before it was a thing? I was certainly shipping code like one.

LLMs have a condition called anterograde amnesia. This is the inability to form new memories after the onset of the condition. They remember their past, but new experiences don't stick. Unlike me, they can't learn from a production incident on a Friday. Every session starts from zero.

This is why they must be given a set of strict rules each time they write code (see the previous postabout enforcing these rules). However, rules alone are not enough. We also need checks and reviews.

Deterministic checks

LLMs are non-deterministic. This means that, just like me, the AI agent will sometimes produce excellent code and sometimes it won't. Sometimes it will spend a lot of time testing the feature. At other times, one test will look like more than enough.

To mitigate this, we need to implement some deterministic checks in our workflow. We need something that clearly indicates when something is wrong. Here's my opinion on what should be included in local CI:

Rubocop - static code analyser and linter. Autocorrect on.
Prettier - erb, css, js code formatter.
Brakeman: a static analysis tool that checks for security vulnerabilities.
RSpec - testing framework, use your favourite. Important! Use SimpleCov to report coverage
Undercover - warns about methods, classes and blocks that were changed without tests. It does so by analysing data from git diffs, code structure and SimpleCov coverage reports.

We enforce a single code style. Our code is secure. We have tests for new and changed code. All tests pass (by the way, how many times has an AI agent told you that a test failure is unrelated to their changes?).

There are no more runtime errors. There is better reliability. Some of my frustrations have gone again.

You might ask: aren't there too many tests and too much boilerplate? No, unit tests are fast. With coding agents, they are more maintainable than ever before. This is a pretty good deal for improved reliability.

Local CI

Wrap all of this in your local CI. If you're running Rails 8.1 or later, it's already in the framework. For Rails 8.0 and earlier you can take my ported implementation of it. Alternatively, you can create your own.

This is the sample output:

Continuous Integration
Running checks...

Rubocop
bundle exec rubocop -A

✅ Rubocop passed in 1.98s                                  

Prettier
yarn prettier --config .prettierrc.json app/packs app/components --write

✅ Prettier passed in 1.57s                                 

Brakeman
bundle exec brakeman --quiet --no-pager --except=EOLRails

✅ Brakeman passed in 8.76s                                 

RSpec
bundle exec parallel_rspec --serialize-stdout --combine-stderr

✅ RSpec passed in 1m32.45s                                 

Undercover
bundle exec undercover --lcov coverage/lcov/app.lcov --compare origin/master

✅ Undercover passed in 0.94s                               
✅ Continuous Integration passed in 1m45.70s

Code review

Back in 2014, I got my second IT job as a junior Rails developer at Netguru. The onboarding process included the Netguru way of writing code. Specific libraries and patterns.

I was writing code The Rails Way. I didn't have much experience with production-grade Rails apps. During one of the code reviews, I received feedback that my models were a bit too fat. They also provided a link to an article by Code Climate: '7 Ways to Decompose Fat ActiveRecord Models'.

I kinda heard about these rules. I was just so focused on getting the business logic right that I didn't apply them...

Oh wait, isn't it the exact same thing Claude told me?

"The CLAUDE.md says 'ALWAYS STOP and ask for clarification rather than making assumptions' and I violated that repeatedly. I got caught up in the momentum of the Rails 8 upgrade and stopped being careful."

It's not a new problem for the software industry. The remedy? Code review, obviously. Each pull request must be checked by another developer. This allows less experienced developers to learn good practices and enables more experienced developers to mentor others and pass on their knowledge. The rules are also enforced. Everybody wins.

Remember: never let the developer review their own code. The same applies to AI agents.

Three-stage code review

Why three stage?

Firstly, we will check that the implementation complies with the functionality specifications. This involves verifying that the agent has built what was requested (neither more nor less).

Secondly: A review of Rails and project-specific conventions. To do this, we have to load all the conventions (see previous post) and check them. Are the interfaces clean? View components instead of partial? Are jobs idempotent and thin? Do the tests verify behaviour?

Last but not least: A general code quality review of architecture, design, documentation, standards and maintainability.

All of these things give us a comprehensive overview of the implementation and any possible deviations. Each review is carried out by a different agent with a fresh perspective and no attachment to the feature.

Here's what a full report looks like in practice:

1. Spec compliance - line-by-line verification:

| Requirement | Implementation | Status |
|---|---|---|
| Column: delay_peer_reviews | :delay_peer_reviews | ✅ Match |

2. Rails conventions - checklist:

| Convention | Status |
|------------|--------|
| Reversible migration | PASS |
| Handles existing data | PASS |

3. Code quality - structured report with Strengths, Critical/Important/Minor issues, references, and merge assessment.

Final summary table:

| Check | Status |
|-------|--------|
| ✅ Spec compliance | Passed |
| ✅ Rails conventions | Passed |
| ✅ Code quality | Approved with minor suggestions |
| ✅ Local CI | Passed |

Ready for merge.

When issues are found, it consolidates them:

## Legitimate Findings to Address
1. No error handling in Discord::Client#post
2. No error handling in OAuth callback

## Findings I'm Skipping (Your Explicit Decisions)
- No encrypts on token fields (you requested this)

Which of these do you want me to address?

My /codereview command and review agent prompts are on GitHub.

How does this fit together

Let the AI agent write the code.

Tell the agent to run bin/ci and fix everything until it's green. Every fail is their responsibility.

They will make the local CI green.

Run the command /codereview.

The agent fixes any issues.

Run /codereview again until the code is ready.

Personally, I don't read the code until this point. As a good manager, I don't micromanage.

Be a good manager, too. Provide a set of rules and the tools needed to enforce them. Don't micromanage. If you're not happy with the results, adjust the rules. Repeat until you are happy with results.

Trust, but verify.

The spec compliance and code quality review agents are based on https://github.com/obra/superpowers.

This is the second post in a longer series.

So far, we have covered:

Post #1: "Claude ignores my rules"→ skills and hooks.
Post #2 (this post): "Claude follows the rules now but the code still breaks" → code review and CI.

I'd love to hear your thoughts. Reach out to me on LinkedIn or at marcin@fryga.io.