DEV Community: Serhii Kravchenko

I Over-Engineered Karpathy's Agent Memory. Here's What Actually Works.

Serhii Kravchenko — Fri, 17 Apr 2026 10:55:11 +0000

Andrej Karpathy sketched out a beautiful idea for AI agent memory. Conversations flow into daily logs. Daily logs get compiled into a wiki. The wiki gets injected back into the next session. Your agent builds its own knowledge base over time.

53K stars on GitHub. I read the code, loved the concept, and built the whole thing into my Claude Code setup.

My version has 8 stars. Size doesn't matter... in this case.

Because three weeks later, half my sessions were losing context and I had no idea why.

What went wrong

The original architecture assumes you have full control over your AI pipeline. Server-side execution. Background processes that always finish cleanly. Programmatic access to transcripts.

Claude Code runs on your laptop. You close the lid, the background process dies. You hit a long session, the transcript parser chokes. You forget to check the logs, and the compile pipeline fails silently for weeks.

Here's what my setup looked like after I copied the pattern:

# session-end.sh (the old version)
# Extracts last 100 turns from transcript
# Spawns flush.py in background
# flush.py calls claude -p Opus to summarize
# flush.py checks if it's after 6pm
# If yes, spawns compile.py (also background)
# compile.py calls claude -p Sonnet to build wiki articles
# CLAUDE_INVOKED_BY guard prevents infinite recursion
# SHA-256 hash comparison skips unchanged logs

That's a lot of moving parts for "remember what I did today."

I tracked the results over three weeks across 7 active projects. The auto-flush hook fired on every session end. About half the time, the background process either timed out, failed to parse the transcript, or exited silently with no output. The daily log file was either empty or never created.

The after-6pm auto-compile? It literally never triggered once in production. The hash check worked fine. The subprocess just never got there.

I was running what looked like a sophisticated knowledge pipeline. In reality, every other session vanished into nothing.

Three fixes that actually worked

I didn't scrap the architecture. The two-layer memory concept is genuinely good. Hot cache for quick patterns, wiki for deep knowledge, daily logs as the raw source. That part stays.

What I killed was the automation layer.

Fix 1: One manual command instead of background automation

I replaced the entire auto-flush pipeline with a single skill called /close-day. You type it when you're done working. That's it.

The difference is simple. When the background process ran, it had no project context. It was parsing raw transcript JSON, trying to figure out what mattered. When you call /close-day inside your session, the agent still has the full picture. It knows what you worked on, what decisions you made, what's pending for tomorrow.

# session-end.sh (the new version)
#!/usr/bin/env bash
set -euo pipefail
if [[ -n "${CLAUDE_INVOKED_BY:-}" ]]; then exit 0; fi
PROJECT_DIR="${CLAUDE_PROJECT_DIR:-$(pwd)}"
mkdir -p "$PROJECT_DIR/.claude/state"
HOOK_INPUT=$(cat)
SESSION_ID=$(echo "$HOOK_INPUT" | python3 -c \
  "import sys,json; print(json.load(sys.stdin).get('session_id','unknown'))" \
  2>/dev/null || echo "unknown")
echo "$(date '+%Y-%m-%d %H:%M:%S') SessionEnd: $SESSION_ID" \
  >> "$PROJECT_DIR/.claude/state/flush.log"
exit 0

30 lines. Logs a timestamp. Done. The actual knowledge capture happens when you deliberately ask for it.

Fix 2: Date tags on everything

Every entry in MEMORY.md now carries a [YYYY-MM-DD] tag. Every update to the project backlog, every note in the session handoff file. When /close-day runs, the agent greps for today's date across all structured files and knows exactly what changed.

This matters when you have long days. I sometimes run 10 to 15 sessions before calling it. The old system tried to parse each session transcript separately and merge them. The new system doesn't care about individual sessions. It looks at the end state of every file and asks: what has today's date on it?

## Proven Patterns

| Pattern | Evidence |
|---------|----------|
| Flush deprecated, /close-day replaces [2026-04-17] | ~50% failure rate |
| README sells simplicity [2026-04-17] | Marketer ICP test |
| NSP self-cleaning rule [2026-04-17] | 455 lines trimmed to 161 |

No transcript parsing. No background merge logic. Just dates and grep.

Fix 3: Move the knowledge base out of .claude/

This one was embarrassing. The compile pipeline uses claude -p to transform daily logs into structured wiki articles. For weeks, it reported success but wrote nothing. Zero wiki articles created.

The problem: Claude Code treats everything under .claude/ as a sensitive directory. Write operations get silently blocked or require special permissions that background subprocesses don't have.

Moving knowledge/ from .claude/memory/knowledge/ to the project root fixed it instantly. One directory move.

# Before (broken)
.claude/memory/knowledge/concepts/
.claude/memory/knowledge/connections/

# After (works)
knowledge/concepts/
knowledge/connections/

I spent actual weeks debugging recursion guards and subprocess timeouts when the real issue was a file path.

What my setup looks like now

The daily workflow is three steps:

Open a session. Context loads automatically from a Python hook that injects the wiki index and recent daily logs.
Work normally. Safety hooks checkpoint progress every 50 exchanges and before context compression.
Type /close-day when done. Agent synthesizes today's changes into a daily article.

That's the whole system. Tomorrow's session picks up where today left off.

After a few weeks, the knowledge base has real substance. Cross-referenced wiki articles about project patterns, architectural decisions, lessons from debugging sessions. All searchable, all structured, all built from actual work rather than raw transcript scraping.

The part I keep thinking about

Karpathy published four principles for working with LLM agents. Principle number two is "Simplicity First. Minimum code that solves the problem."

I took his agent memory concept and built a 300-line automation pipeline on top of it. Background subprocesses, recursion guards, hash comparisons, transcript parsers. And then I replaced all of it with one 30-line script and a manual command.

The architecture was his. The over-engineering was mine.

The simplified version is open source. Takes about five minutes to set up:

awrshift / claude-memory-kit

The OS layer for Claude Code — memory, hooks, knowledge pipeline, experiments. No external deps. v3

Claude Memory Kit

Your Claude agent remembers everything. Across sessions. Across projects. Zero setup.

The Problem

Every new Claude session starts from zero. Yesterday's decisions, last week's research, the bug you fixed three days ago — gone. You waste the first 10 minutes re-explaining what Claude already knew.

Claude Memory Kit fixes this in 3 commands. No API cost. Runs on your existing subscription.

Get Started

git clone https://github.com/awrshift/claude-memory-kit.git my-project
cd my-project
claude

That's it. Claude sets everything up and asks a few questions (your name, project name, language).

Tip

Type /tour after setup — Claude walks you through the system using your actual files.

Before and After

Without Memory Kit	With Memory Kit
New session	Starts from zero. "What project is this?"	Knows your project, last session, current tasks
After 10 sessions	Nothing accumulated	Searchable wiki of decisions, patterns, lessons
Multiple projects	Total chaos	Each project has its own

…

View on GitHub

Built from 700+ production sessions across 7 projects. If you hit similar issues with agent memory, I'd genuinely like to hear about it in the comments.

I Run 7 Projects in Claude Code Simultaneously. Here's the Memory System That Makes It Possible.

Serhii Kravchenko — Fri, 10 Apr 2026 09:09:55 +0000

I need to be honest about something upfront.

I didn't invent persistent memory for Claude Code. By April 2026, there are tools with 46,000+ GitHub stars solving this problem. There are 700,000+ skills indexed on aggregators. The ecosystem is massive.

What I did is something different. I spent four months running Claude Code across 7 projects simultaneously — a content production platform, a marketing site, a backend API, an open-source toolkit, and three more. Real businesses. Real clients. Real deadlines. And I assembled the memory system that actually survives this kind of workload.

Claude Memory Kit v3 is not a research project. It's the system I use every single day, and I'm writing this to explain what's in it and where every piece came from.

Where The Pieces Come From

I want to be transparent about provenance. This kit is a curated assembly as of April 10, 2026:

From Andrej Karpathy — the core architecture. His LLM Knowledge Base idea: treat your conversations as source code, let an LLM compile them into structured knowledge. Daily logs = source. LLM = compiler. Knowledge articles = executable output. This isn't my idea. It's his. I just built a working implementation.

From Cole Medin (claude-memory-compiler) — three specific features I ported into v3:

SessionStart injection — the hook that pre-loads your knowledge index into every session
End-of-day auto-compile — daily logs become wiki articles after 6 PM automatically
The CLAUDE_INVOKED_BY recursion guard — prevents infinite loops when the pipeline calls Claude, which triggers hooks, which call Claude again

Cole built the original Karpathy-inspired prototype. I ported the three features that mattered most and rewrote them for Python stdlib (no uv, no agent-sdk, just subprocess and os).

From Anthropic's own engineers — the hook system itself. Claude Code's PreToolUse, PostToolUse, SessionStart, SessionEnd hooks are what make all of this possible. The additionalContext pattern that lets you inject data at session start? That's Anthropic's API. I just use it aggressively.

From the community — the pre-compact blocking pattern (I first saw it in a GitHub issue requesting layered memory), the periodic-save idea (adapted from multiple "I lost everything after compaction" horror stories), and the [[wikilink]] format for knowledge articles (Obsidian community standard).

From my own production use — everything else. The 5-layer context pyramid. The multi-project  tags. The 200-line MEMORY.md cap (learned the hard way — Claude starts ignoring entries after ~200 lines). The experiment sandbox pattern. The daily log → knowledge compilation pipeline that actually runs unattended. The 50-exchange periodic save interval (not 15, not 100 — 50 is the sweet spot after months of tuning).

I'm not claiming to be first. I'm claiming this combination works in production, and I can show you why.

My Setup: 7 Projects, One System

Here's what I actually run on this right now:

Project	What it is	Role
Content platform	7-stage article generation pipeline	S1-S7 stages, $0.08/article production cost
Marketing site	Landing page + content	TanStack Start, Vercel
Backend API	NestJS service for clients	Code review, PR workflow
CLI tool	Content quality evaluation engine	Burstiness, AI detection, SEO scoring
Open source toolkit	This kit + skills + starter templates	Distribution, community
R&D hub	Coordination across all projects	Research, decisions, methodology
Design collaboration	UI concepts for client projects	Design tokens, visual QA

Every single one of these runs on Memory Kit. Each project has its own CLAUDE.md, its own rules, its own backlog. But they all share the same architecture — the same hooks, the same knowledge pipeline, the same session lifecycle.

When I switch from the content platform to the marketing site, Claude doesn't ask "what's this project about?" It reads the project's backlog and picks up where I left off. When I find a pattern in one project that applies to another, I write it once in MEMORY.md and it's available everywhere.

This isn't a demo. This is my Tuesday.

What's Actually Inside (10 Components)

Component	File	One-line explanation
Brain	`CLAUDE.md`	Who the agent is, how it behaves, session workflow
Hot Memory	`.claude/memory/MEMORY.md`	Fast-access patterns, < 200 lines, loaded every session
Deep Memory	`knowledge/`	Wiki articles with `[[wikilinks]]` — auto-compiled from your work
Hooks	`.claude/hooks/`	5 scripts that fire automatically at key moments
Rules	`.claude/rules/`	Your domain conventions — brand voice, client specs, workflow rules
Commands	`.claude/commands/`	`/memory-compile`, `/memory-lint`, `/memory-query`
Projects	`projects/X/BACKLOG.md`	Per-project task queue with inline decisions
Context Hub	`context/next-session-prompt.md`	"Pick up exactly here" — with `<!-- PROJECT:name -->` sections
Daily Logs	`daily/`	Automatic session transcripts — you never write these
Experiments	`experiments/`	Sandbox for research before committing to a path

Everything is plain Markdown. No database. No external services. If anything breaks, git checkout fixes it. I chose this deliberately — after evaluating SQLite-based solutions, vector embeddings, and graph databases, plain text won because it's the only format that survives everything: git, backups, editor changes, Claude Code updates.

The 5 Hooks (This Is Where The Magic Lives)

Hooks are the reason this system works without you thinking about it. Each one fires at a specific moment and does one job.

Hook 1: Session Start (`session-start.py`)

When: Every time you open Claude Code
What it does: Injects your knowledge index, recent daily logs, and top 3 most recent concept articles into the session. 50K character budget — on Opus 4.6's 1M context window, that's ~5%.
Why it matters: Claude starts every session already knowing what articles exist in your wiki, what you worked on yesterday, and what your key patterns are. No "read my files" prompt needed.
Origin: Adapted from Cole Medin's claude-memory-compiler, rewritten in Python stdlib.

Hook 2: Periodic Save (`periodic-save.sh`)

When: Every 50 exchanges (configurable)
What it does: Blocks Claude and forces it to save: update MEMORY.md with new patterns, update next-session-prompt with current state, update BACKLOG task statuses.
Why it matters: I lost 3 hours of work in month one because a long session compacted without saving. Never again. 50 is the sweet spot — 15 was too noisy, 100 was too risky.
Origin: My own pain. Tuned over ~60 production sessions.

Hook 3: Pre-Compact Guard (`pre-compact.sh`)

When: Before Claude Code compresses the context window
What it does: Checks if MEMORY.md was updated in the last 2 minutes. If not, blocks compaction. Claude must save first.
Why it matters: Compaction is where memory dies. This hook is a seatbelt. The 2-minute window is tight on purpose — it forces the agent to actually touch memory files right before compression, not rely on a save from an hour ago.
Origin: Community request pattern from GitHub issue #27298 about layered memory loss.

Hook 4: Session End (`session-end.sh`)

When: Claude Code process terminates
What it does: Extracts the last 100 turns from the transcript, spawns flush.py in background. flush.py uses claude -p (your existing subscription, zero extra cost) to distill the session into structured Markdown, appends to daily/YYYY-MM-DD.md.
Why it matters: Every conversation becomes searchable history. You don't do anything — it captures automatically.
Origin: flush.py logic adapted from Cole Medin's extractor, rewritten for Python subprocess (no agent-sdk dependency).

Hook 5: Test Protection (`protect-tests.sh`)

When: Any time Claude tries to edit an existing test file
What it does: Blocks the edit. Claude can create new tests, but can't modify existing ones.
Why it matters: When tests fail, the instinct (for both humans and AI) is to "fix the test." This hook forces fixing the implementation instead. Sounds minor, but it saved me from subtle regressions at least twice.
Origin: My own rule after Claude "fixed" a test by relaxing the assertion.

The Knowledge Pipeline (Karpathy's Architecture In Practice)

This is the most powerful part, and it's the piece that comes directly from Karpathy's insight.

The chain:

You have a conversation → Hook captures it → flush.py distills it into structured notes → Appended to daily/2026-04-10.md → After 6 PM, compile.py transforms daily logs into wiki articles with YAML frontmatter and [[wikilinks]] → Next session starts with the updated wiki catalog already injected.

The key insight: you don't organize your knowledge. You have conversations, and the LLM handles the synthesis, cross-referencing, and categorization. After a few weeks, you have a personal wiki that grew entirely from your work.

Cost: $0 extra. The pipeline uses claude -p which runs on your existing Max/Pro subscription. No API key charges. I verified this by running it for a month and checking my billing — zero incremental cost.

Safety: recursion guard. flush.py calls claude -p, which starts a new Claude session, which fires SessionEnd hook, which would call flush.py again — infinite loop. The CLAUDE_INVOKED_BY env var breaks the cycle. Every hook checks it at the top and exits if set. Took me an afternoon to debug the first time it happened. Now it's documented and automatic.

The Context Pyramid (Why Not Load Everything?)

Layer	What	When	Size
L1: Auto	CLAUDE.md + rules + MEMORY.md + SessionStart injection	Every session	~50K chars
L2: Start	next-session-prompt.md	First thing the agent reads	~2-5K chars
L3: Project	BACKLOG.md for current project	When you start working	~5-20K chars
L4: Wiki	Knowledge articles	On-demand (agent knows they exist from L1 index)	Unlimited
L5: Raw	daily/ logs	Never read directly — source material for pipeline	Unlimited

Why the pyramid? Because I tried loading everything. With 7 projects, that's 50+ files. Claude's context filled up in 10 minutes, compacted, and lost the work context. The pyramid ensures 95% of the context window is available for actual work, and the remaining 5% is the right context at the right time.

For Marketers (Why I'm Writing This For You)

I've been watching this space closely. By April 2026, there are:

46,000+ stars on claude-mem (the most popular memory tool)
143,000+ stars on Superpowers (the most popular framework)
700,000+ skills indexed on SkillsMP
107,000+ skills on agentskill.sh, including 25,000+ marketing-specific ones
Free courses at cc4.marketing and ccforeveryone.com
An official Anthropic Marketing Plugin

The ecosystem is enormous and growing. So why am I writing this?

Because I noticed a gap. Most of these tools are built by developers, for developers. The marketing skills exist, but nobody is showing how to connect them into a system that remembers your clients, your brand guidelines, and your content strategy across sessions.

A skill that writes SEO content is great. But if it forgets your client's brand voice every time you restart Claude Code, it's just a fancy prompt template.

Memory Kit isn't a skill. It's the operating system layer that makes your skills, your rules, and your context persist. When you install a marketing skill, Memory Kit ensures Claude remembers how to use it the way you use it — with your client data, your conventions, your history.

Practical example: You install an SEO audit skill. First run, you explain your client's niche, target keywords, and competitor URLs. Without Memory Kit, next session you explain it again. With Memory Kit, that context lives in rules/client-name.md and knowledge/concepts/client-seo-strategy.md. Every future audit starts with full context.

"Do I Need to Know How to Code?"

No. And I mean that literally.

After the 3-command install, Claude asks you 5 questions in plain language: project name, your name, language preference, project description, starting fresh or importing existing work. Then it configures everything.

From that point:

"Write three emails for the Acme campaign" — works
"What do we know about our SEO gaps?" — Claude searches its memory
"Save what we discussed" — Claude updates memory and context
/tour — Claude walks you through every file, explains what each one does

The files are plain Markdown. If you've used Notion, you can read these. But you don't have to — Claude manages them.

Honest Comparison

I'm not going to pretend alternatives don't exist. Here's where things stand:

	Memory Kit	claude-mem (46K stars)	Cog	Built-in CLAUDE.md
Setup	3 commands, 5 min	Plugin install	Clone + configure	Already there
Auto-learns	Yes (hooks + compile)	Yes (SQLite + embeddings)	Manual conventions	No
Dependencies	Zero (Python stdlib)	TypeScript + SQLite	Zero	None
Multi-project	Built-in (PROJECT tags)	Single project	Single project	Single file
Knowledge format	Wiki with wikilinks	Compressed vectors	Filesystem	Flat file
Pipeline	Karpathy-style compile	AI compression	None	None
Stars	6	46,000+	~2,000	N/A

Yes, 6 stars. I'm not hiding it. claude-mem has 7,700x more stars and a beautiful marketing site. If you want the most popular option, go there.

Memory Kit's edge: multi-project structure (PROJECT tags, per-project backlogs, shared memory), zero dependencies (no npm, no SQLite, no TypeScript runtime), and the Karpathy-style knowledge compilation pipeline that turns raw conversations into structured, cross-referenced wiki articles.

If you work on one project, claude-mem is probably simpler. If you juggle multiple clients, campaigns, or products — that's where Memory Kit was built and tested.

Getting Started

You need

Claude Code CLI
Claude Pro ($20/mo) or Max subscription
A terminal (Terminal on Mac, WSL2 on Windows)

Install

git clone https://github.com/awrshift/claude-memory-kit.git my-project
cd my-project
claude

First 10 minutes

Answer Claude's 5 setup questions — name, project, language, description, fresh/existing
Type /tour — Claude walks through every file with interactive explanations
Tell Claude about your first client — "Create a rule for [client name] with this brand voice: [paste guidelines]"
Start working normally — the hooks handle everything else

After a few sessions, check daily/ — you'll see conversation logs appearing. After a few days, check knowledge/ — structured articles growing from your work.

What I Learned Building This

A few things I didn't expect:

The 200-line limit is real. MEMORY.md is auto-loaded every session. Anthropic truncates after ~200 lines. I hit this wall at month two with 180 entries. Now I aggressively move detailed patterns into wiki articles and keep MEMORY.md as an index. The [YYYY-MM] date tag on every entry helps — I can prune old patterns that no longer apply.

50 exchanges is the save interval sweet spot. At 15 (v2 default), Claude saved too often — broke flow. At 100+, I lost significant context after compaction. 50 exchanges is roughly 30-45 minutes of focused work. Enough to accumulate meaningful patterns, not so long that you risk losing them.

Plain text beats everything. I evaluated SQLite, vector embeddings, Neo4j graph. Plain Markdown won because: git tracks changes, any editor reads it, Claude Code's Read/Write tools handle it natively, and it survives every upgrade. When claude-mem upgrades their schema, you migrate. When I upgrade Memory Kit, you git pull.

The recursion bug was terrifying. First time flush.py triggered an infinite loop — claude -p spawning claude -p spawning claude -p — I had 47 Claude processes running before I killed them. The CLAUDE_INVOKED_BY guard was born that evening. It's three lines of code that prevent infinite recursion across the entire pipeline. If you build anything that spawns claude -p from hooks, steal this pattern.

The repo is at github.com/awrshift/claude-memory-kit. MIT license. I use it every day. If you try it, let me know what works and what doesn't — the best improvements to this system came from actual production use, not theory.

Third article in the "Claude Code for the Rest of Us" series. I'm @pmserhii — I build open-source AI tools at awrshift and run production content pipelines. This is what I actually use.

Why I Make Claude and Gemini Argue: Building an Adversarial Agentic Workflow (Open-Source Skill)

Serhii Kravchenko — Tue, 31 Mar 2026 13:45:10 +0000

In traditional engineering, you'd never let a developer merge code without a peer review.

So why are we letting AI grade its own homework?

I've been building with Claude Code for 750+ sessions across multiple projects — a content pipeline, a marketing site, a design system, decision frameworks. Somewhere around session 200, I noticed a pattern: Claude is brilliant, but it has consistent blind spots. It favors certain architectures. It misses edge cases in its own prompts. It quietly accepts assumptions that a different perspective would challenge.

So I did something unconventional: I gave Claude a sparring partner.

I built an open-source skill called Brainstorm that runs a structured 3-round adversarial dialogue between Claude Code and Google's Gemini. Not a simple "ask two models the same question" approach — a real debate where each model challenges the other's reasoning, and they converge on a single actionable recommendation.

Here's the repo: Claude Starter Kit — the Brainstorm skill is included alongside memory, hooks, and three other Claude Code skills. MIT license, works out of the box.

Let me show you what happened when I put this to work.

The Problem: Single-Model Agentic Workflows Hit a Ceiling

If you use Claude Code daily, you know how productive it is. It reads your codebase, makes changes, runs tests, iterates. For straightforward tasks, it's incredible.

But for decisions — architecture choices, design system approaches, prompt engineering, evaluation criteria — a single model creates a feedback loop. Claude designs a solution, Claude evaluates it, Claude declares it good. There's no external challenge.

You might ask: why not just prompt Claude to red-team its own output? Or use a cheaper Claude model to draft and a smarter one to review? We tried both. The problem is fundamental — same provider, same training corpus, same architectural biases. Claude challenging Claude is like asking someone to proofread their own essay. They'll catch typos but miss the structural issues. A different model family with different training data and different instincts is what breaks the loop.

I hit this wall three separate times before I built a systematic fix:

Wall 1: Design system architecture. Claude recommended using Google's Stitch tool with post-processing to fix design token adherence. Sounded reasonable. Spent two days implementing it. Token adherence: 35%. The approach was fundamentally flawed, and Claude couldn't see it because it designed it.

Wall 2: Content pipeline prompts. Claude wrote evaluation prompts for our 7-stage content pipeline. The prompts looked great — well-structured, detailed, comprehensive. But when we actually measured output quality, the scores were mediocre. The prompts had loopholes that Claude couldn't identify in its own work.

Wall 3: Quality metrics. Claude designed metrics to evaluate content quality, then evaluated content using those same metrics. Circular validation. The scores looked good on paper but didn't reflect real quality improvements.

Every one of these failures had the same root cause: no adversarial pressure. The model was reviewing its own work with its own biases.

Enter the Brainstorm Skill: Claude vs Gemini in the Terminal

The Brainstorm skill runs a 3-round structured debate between Claude and Gemini. It's not random back-and-forth — each round has a specific purpose:

Round 1 — Diverge. Both models propose different approaches to the problem. Claude brings codebase context (it can read your files). Gemini brings a fresh perspective from a completely different model family with different training biases.

Round 2 — Deepen. Each model challenges the other's proposal. "What happens when input is empty?" "What about the edge case where the user has 12 languages?" "Your approach assumes X, but what if Y?" This is where the real value emerges — the challenges neither model would generate reviewing its own work.

Round 3 — Converge. After two rounds of productive conflict, the models synthesize a single recommendation with clear reasoning. You get one actionable path forward, not two competing opinions.

How Gemini gets context: Claude orchestrates the entire flow. It reads your local files, summarizes the relevant context, and passes it to Gemini via the Google GenAI API along with the debate prompt. Gemini never touches your filesystem directly — Claude acts as the bridge, deciding what context is relevant to share. This means your code stays local while Gemini gets exactly the context it needs to give meaningful critique.

The architecture is deliberate. Gemini uses a two-layer approach: Flash-Lite with Google Search grounding gathers real-world facts first (the "ground truth" phase), then Pro reasons on verified data. A mandatory fact-check phase at the end catches any claims that slipped through. A typical brainstorm takes about 40-60 seconds and costs roughly $0.02-0.05 in API calls — overkill for fixing a typo, but invaluable for architecture decisions.

# Install (one command)
git clone https://github.com/awrshift/claude-starter-kit.git my-project
cd my-project && claude

# Or add just the skill to an existing project
git clone https://github.com/awrshift/skill-brainstorm.git .claude/skills/brainstorm

Then just say "brainstorm" in Claude Code, describe your problem, and watch the debate unfold.

Case Study 1: How an AI Code Review Loop Replaced Our Designer

This is the one that convinced me adversarial agentic workflows are the future of development.

We were building a marketing site called Avoid Content. The question: how should Claude generate UI components that precisely follow our design tokens (colors, typography, spacing)?

Claude's position (Round 1): Use Google Stitch to generate screens, then post-process the output to replace colors and fonts with our design tokens. Reasonable — Stitch is a powerful UI generation tool.

Gemini's challenge (Round 2): "Post-processing is fragile. What happens when Stitch generates a gradient that mixes two non-token colors? What about hover states? You'll spend more time fixing edge cases than you save." Gemini argued for generating code directly from design tokens — skip Stitch entirely, use a frontend-design approach where tokens are injected into the generation prompt.

The convergence (Round 3): Test both approaches, measure token adherence.

Results:

Stitch + post-processing: 35% token adherence (18 color references in benchmark components, only 6 matched)
Direct generation from tokens: 100% token adherence on benchmark components (18/18 exact hex matches — the strict token schema forced compliance)

That brainstorm literally replaced our entire design workflow. And it went further — we built a visual QA loop on top of it:

Claude generates a component using design tokens
Playwright takes a screenshot
Gemini reviews the screenshot visually against a reference design
Claude fixes issues Gemini identified
Repeat (max 2 iterations)

Typography scores went from 5/10 to 8/10 in a single iteration. Spacing, visual hierarchy, overall polish — all improved measurably because a different model family was doing the AI code review.

We effectively replaced manual design reviews with an automated Claude + Gemini loop. Not "AI-assisted design" — AI-driven design with AI-driven quality assurance. The entire Avoid Content site was built this way: Claude prototyping, Gemini reviewing screenshots, iterating until both models agreed the output was solid.

Case Study 2: Gemini Gates in a 7-Stage Content Pipeline

Our content generation platform runs articles through seven stages: Strategy, Outline, Research, Generate, Verify, Optimize, Finalize. Each stage has specific quality metrics.

The breakthrough wasn't using Gemini to generate content. It was using Gemini as a gate — a checkpoint that must approve output before it moves to the next stage.

At the prompt design phase for Stage 4 (article generation), we ran the prompt through Gemini for stress-testing:

"Identify how an LLM could misinterpret this prompt. Find loopholes, missing constraints, ambiguous rules."

Gemini found three critical loopholes Claude missed:

The prompt said "avoid repetitive sentence starters" but didn't define what counts as repetitive (per-section vs. article-level)
Temperature 1.0 + negative instructions ("don't use X") triggered the pink elephant effect — the model used X more, not less
Word count targets lacked adaptive coefficients, causing +25% overshoot

After fixing these based on Gemini's critique, article quality jumped measurably. And here's the finding we've now confirmed three separate times:

Prompt quality > model quality. A basic prompt on Gemini Pro performed identically to Flash (50% quality score). The same models with a stress-tested prompt hit 75-80%. The bottleneck was never the model — it was the prompt.

This is the single most important lesson from running two AI models together. You don't need a more expensive model. You need a different model family to find the holes in your prompts.

The same pattern held for our AWRSHIFT decision framework — a structured tool for non-trivial choices (which architecture? build vs. buy?). Through brainstorm sessions, Gemini pushed back on Claude's overcomplicated 5-mode design and the system converged on a single adaptive flow. Simpler for users, more flexible for the system. That framework is also open-source: skill-awrshift.

A Mini Claude Code Tutorial: The Technical Setup

Getting this running takes about five minutes. Here's what you need:

Prerequisites:

Claude Code installed (CLI or Desktop)
A free Google API key for Gemini
Python 3.10+ with pip install google-genai

Option 1: Full Starter Kit (recommended)

git clone https://github.com/awrshift/claude-starter-kit.git my-project
cd my-project
claude

Claude runs the setup automatically — asks your name, project description, language preference, and configures everything. You get four Claude Code skills out of the box:

Skill	What it does
Brainstorm	3-round Claude x Gemini adversarial dialogue
Gemini	Quick second opinions, prompt stress-tests, visual reviews
AWRSHIFT	Structured decision framework with Gemini gates
Skill Creator	Build and test your own custom skills

Plus a persistent memory system, session hooks, multi-project journals, and experiments tracking.

Option 2: Just the Brainstorm skill

# Add to existing Claude Code project
git clone https://github.com/awrshift/skill-brainstorm.git .claude/skills/brainstorm

# Set up Gemini
echo "GOOGLE_API_KEY=your-key-here" >> .env
pip install google-genai

Then in Claude Code, just say "brainstorm [your question]" and it works.

Option 3: Gemini skill only (for quick second opinions)

git clone https://github.com/awrshift/skill-gemini.git .claude/skills/gemini

Use it for one-off checks: "ask Gemini if this architecture makes sense" or "get a second opinion on this prompt."

When to Use Brainstorm vs. Second Opinion

Not every decision needs a 3-round debate. Here's how we think about it after 750+ sessions:

Situation	Tool	Why
One clear path, need validation	`gemini second-opinion`	Quick (~5s), single-round, catches obvious issues
Multiple viable approaches	`brainstorm`	Full 3-round debate (~45s), converges on one answer
Prompt stress-testing	`gemini second-opinion`	Find loopholes before deploying prompts
Architecture decisions	`brainstorm`	Different model = different design instincts
Visual design review	`gemini --image`	Multimodal review of screenshots
Fact verification	`gemini second-opinion`	Cross-model validation of claims

The rule is simple: if there's one path and you need a sanity check, use Gemini directly. If there are multiple paths and you need to converge, use Brainstorm.

What We Learned: Five Rules for Multi-Model Agentic Workflows

After building production systems with Claude + Gemini, these patterns held up consistently:

1. Prompt quality always beats model upgrades.
We proved this three times across different domains. A well-crafted prompt on a cheaper model outperforms a lazy prompt on an expensive one. Use Gemini to stress-test your prompts before optimizing your model tier.

2. Gemini's critique is an input, never the decision.
44% of Gemini's insights were genuinely unique — things Claude would never catch on its own. But Gemini lacks your codebase context, your prior decisions, your constraints. Always evaluate its recommendations critically before acting.

3. Different model families catch different blind spots.
This is the whole thesis. Claude and Gemini have different training data, different architectures, different biases. When they disagree, that's where the most valuable insights hide. When they agree, you can be more confident the answer is solid.

4. Fact-check after every brainstorm.
We found that 2 out of 6 brainstorm decisions were invalidated when we checked the claims against live web data. The mandatory fact-check phase (built into Brainstorm v2.1) catches these before they become expensive mistakes.

5. The visual QA loop is underrated.
Most developers only use text-to-text generation. Adding Playwright screenshots + Gemini visual review creates an AI code review feedback loop that catches UI issues no text-based review ever could. Typography, spacing, color contrast — Gemini sees what Claude can only describe.

Exploring the Claude Code Skills Ecosystem

If you're new to Claude Code skills, they're essentially reusable capabilities you add to your agent. A skill is a folder with a SKILL.md file that tells Claude when and how to use it.

The ecosystem is growing fast — skills can be shared through the Claude Code plugins marketplace, and the Starter Kit includes a Skill Creator that lets you build your own skills and test them with an eval framework.

Some ideas for custom skills you could build:

A skill that runs your test suite and interprets failures
A skill that checks your PR against your team's code style guide
A skill that queries your production logs when debugging

Check the Claude skills on GitHub for more examples and inspiration.

Try It Yourself

Everything mentioned in this article is open-source and free:

Claude Starter Kit — Full setup with memory, skills, hooks, and multi-project support
skill-brainstorm — Standalone 3-round Claude x Gemini adversarial dialogue
skill-gemini — Quick second opinions and visual reviews
skill-awrshift — Structured decision framework with Gemini gates

The setup takes five minutes. The first brainstorm session will probably change how you think about AI pair programming.

Because the real power isn't in having a smarter AI. It's in having two AIs that think differently — and making them argue until the best answer wins.

Built by Serhii Kravchenko — based on 750+ sessions building AI content pipelines, multi-agent systems, and design automation with Claude Code.

We're working toward getting Brainstorm into the official Claude Code plugins directory. If this workflow saved you time, a star on the Starter Kit repo helps us get there faster.

How I Built a Memory System for Claude Code and Open-Sourced It

Serhii Kravchenko — Fri, 27 Mar 2026 08:50:34 +0000

You open Claude Code. You work for an hour — refactoring, debugging, building something real. You close the terminal. Next morning you type claude and... it has no idea what happened yesterday.

I've been there about a thousand times. Literally.

After 1000+ sessions building content pipelines, multi-agent systems, and GEO optimization tools with Claude Code, I got fed up with the context amnesia. So I built a system that fixes it. And today I'm open-sourcing everything.

Repo: awrshift/claude-starter-kit

The Problem Nobody Talks About

Claude Code is powerful. It reads your codebase, runs tests, writes code that actually works. But it has one brutal limitation — no persistent memory between sessions.

Every time you start a new session, you're back to square one. The agent doesn't remember:

What you worked on yesterday
Which architectural decisions you made
What patterns keep causing bugs
Where you left off

So you burn 10-15 minutes every session just re-loading context. Multiply that by 5 sessions a day, 5 days a week — that's over 6 hours a month wasted on "hey Claude, remember when we..."

Most devs I've talked to handle this with a fat CLAUDE.md file. That works until it doesn't. Once your project grows past 3 weeks of work, a single instruction file can't hold everything you need.

What I Built Instead

The starter kit gives Claude Code three things it doesn't have out of the box:

1. Persistent memory — a .claude/memory/ directory with three files that survive between sessions. MEMORY.md stores long-term patterns ("this API always returns pagination headers"). CONTEXT.md is a quick-orientation card ("currently working on auth module, tests are failing"). snapshots/ keeps session backups so nothing gets lost when the conversation compresses.

2. Session continuity — a next-session-prompt.md file that acts as a cross-project hub. Each project gets its own tagged section, so multiple Claude Code windows can work on different projects in parallel without stepping on each other.

3. Hooks that protect you — a session-start.sh that shows memory summary + git status when you open a session. And a pre-compact.sh that fires before Claude compresses your conversation — it forces the agent to save context before anything gets lost.

The Four-Layer Context System

Not everything needs to load every time. The kit uses a pyramid:

L1 — Auto (every session): CLAUDE.md + domain rules + MEMORY.md. This is the agent's identity and accumulated knowledge. Loads automatically, always.

L2 — Start (session start): next-session-prompt.md + CONTEXT.md. Orientation layer — what project am I in? What's next? What happened last time?

L3 — Project (on demand): projects/X/JOURNAL.md. Each project has one file for tasks, decisions, and status. The agent reads it when you start working on that project.

L4 — Reference (when needed): Docs, snapshots, anything deep. Pulled only when relevant — keeps token usage low.

The pyramid means Claude always knows who it is (L1), quickly orients itself (L2), and dives deep only when needed (L3-L4). No wasted context window.

Three Skills Included

The kit ships with three global skills that install to ~/.claude/skills/ on first run:

Gemini — get second opinions from Google's Gemini models. Different model family = different blind spots. I use this for prompt stress-testing and hypothesis falsification.

Brainstorm — a 3-round adversarial dialogue between Claude and Gemini. Round 1: diverge. Round 2: challenge weak points. Round 3: converge on one action. For architecture decisions it's worth every token.

Design — full design system lifecycle from URL to production CSS. Extract colors, compute palettes, generate tokens, audit HTML, run visual QA loops.

How to Get Started

cp -r claude-starter-kit my-project
cd my-project
claude

That's it. On first launch, Claude reads CLAUDE.md, sees the setup instructions, and configures everything automatically — installs skills, sets up memory, initializes git, cleans scaffolding.

No manual configuration. You can start working immediately.

What Changes After a Week

The real value isn't day one. It's day seven.

By then, MEMORY.md has 15-20 verified patterns from your work. Things like "this ORM silently drops null values" or "user prefers 2-space indentation". The agent stops asking and starts knowing.

next-session-prompt.md has a clean thread of where each project stands. You switch between three projects? Each one picks up exactly where it left off.

The pre-compact.sh hook has saved your context at least twice — you didn't even notice because it just worked.

Lessons from 1000 Sessions

The agent won't use memory unless you tell it to. Claude Code has an auto-memory directory, but in my experience it stays empty. The system-level mechanism exists, but without explicit instructions in CLAUDE.md, the agent rarely writes to it. That's why the starter kit includes both the files and the instructions.

Multi-project safety matters more than you think. Two Claude Code windows editing the same file = silent data loss. The PROJECT tags solve this — each window only edits its own section.

Pre-compact hooks are essential, not optional. When Claude's conversation gets too long, it compresses the history. If your context wasn't saved before compression, it's gone.

Skills should live globally, not per-project. I tried per-project skills. Then I had 8 copies of the same Gemini skill, each slightly out of date. Global install works much better.

What's Next

The kit is MIT-licensed and contributions are welcome. Areas that need work:

More starter templates — framework-specific (Next.js, Python, Rust)
Skill discovery — better triggering descriptions
Conflict resolution — true parallel writes still need locking

If you're building with Claude Code daily, give the starter kit a try. The setup takes 30 seconds and the payoff compounds with every session.

GitHub: awrshift/claude-starter-kit

Built by Serhii Kravchenko — based on 1000+ sessions of iterative refinement with Claude Code.