Malisa

Posted on Jan 14

Claude Code Experiment: More Tokens Doesn't Mean Better Code

#ai #wecoded

As we kick off the new year, many companies are full speed ahead on leveraging AI tooling for productivity. In fact, some companies are looking at token usage as a way to assess AI adoption and productivity. As I began my own exploration of the Claude Code CLI, I couldn't help but wonder...are tokens the new lines of code; easy to count, impossible to trust?

After taking Anthropic's "Claude Code in Action" course, I developed an experiment to test my new knowledge. My hypothesis was simple: Claude Code features follow a diminishing returns curve. At some point, more tokens don't return better code.

My Experiment: Build a good ol' toy app, CLI Tic Tac Toe, four times, each using a different Claude Code technique. For each build, I logged token usage, created a QA agent to find bugs, and created a prompt to assess code quality. Then I used these data to create a metric: quality per token (QPT) and compared the QPT per technique to test my hypothesis.

The Four Techniques for building Tic Tac Toe:

Zero-shot: Raw prompt, no context
Plan Mode: Explicit planning step before execution
CLAUDE.md: Project context file, no planning
CLAUDE.md + Plan Mode: Combined approach

I used claude --verbose to track token usage.

Note: I used a fresh Claude session for each experiment.

Experiment 1: zero-shot prompt
> Can you create a CLI tic tac toe game using vanilla javascript with minimal dependencies. I would like it to be tested using jest and I would like it to be good quality code using classes.

Experiment 2: Plan Mode

> --plan <Same Prompt from Experiment #1>

Experiment 3: CLAUDE.md context file
I created a file in the project directory like this:

and ran it:

> create the game

Experiment 4: Same CLAUDE.md file from Experiment 3 + plan mode

--plan create the game

Assessing Quality: Role based prompt
> You are a senior engineer, can you assess this code for quality and give it a score code 1-5 on correctness, clarity, structure, maintainability, and extendability? Can you return an average of the scores?

Manual QA Agent
I created a test.js file and added context for the agent to play 10 TTT games and report any bugs found. All four iterations reported zero bugs.

Notes:

This experiment was designed very small for fast understanding using a well-known problem.
This experiment used a lot of AI, but that was by design. I wanted to understand the tooling and token use, not how to build TTT... I've done it many times 😉.

Results

Approach	Tokens	Quality	QPT
CLAUDE.md	25,767	4.9	0.190
CLAUDE.md + --plan	32,191	4.6	0.143
Zero-shot	42,737	4.8	0.112
--plan	52,910	4.8	0.091

Conclusion

My assumption was partially true, the least expensive approach was also the best quality. As tokens increased, quality stayed flat or dropped. However, I didn't anticipate that the lowest token investment returned the highest quality.

My colleague Rani recently wrote:

"Context is Currency... In the world of AI, if you don't give the model the right background and constraints, it will confidently give you the wrong answer."

In this case, less context meant lower quality for more tokens.

I suspect the next wave after AI adoption will be understanding optimization and return on AI investment.

Some of the earliest company wide AI adopters are already weighing in:

For AI models, context matters more than token use. I'm interested in digging into context engineering next, it seems that well structured context yields better results than simply feeding more tokens to the model. It really does seem like an art.

🖼️ Cover Art: "Urania" depicted by Giacinto Gimignani, 1852

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.