DEV Community: Chudi Nnorukam

Entity Optimization for Brands in AI Search

Chudi Nnorukam — Wed, 22 Apr 2026 22:13:03 +0000

Originally published at chudi.dev

AI search engines do not rank pages. They score entities, then quote the entity that best matches the query. For a sub-DR-20 brand, this is the good news. You cannot outspend enterprise SEO teams on backlinks. You can out-engineer them on entity coherence.

This post is the engineering playbook for that work. It covers the three schemas that anchor the graph, the sameAs surface where most brands leak trust, the knowsAbout field that tells engines what you are an authority on, and the measurement loop that tells you when it is working. The reference model throughout is chudi.dev (the Person brand) and citability.dev (the product brand): a deliberate two-node graph that synergizes without cannibalizing.

§1 — Why AI citations resolve to entities, not URLs

Classical SEO optimized for pages. Answer engines optimize for the entity behind the page. If two sites say the same thing and one site has a coherent Person + Organization + sameAs graph, the engine quotes the coherent site even when the other page ranks higher in traditional SERPs. Cite the entity, not the URL. That is the mental model shift this cluster pivots on.

§2 — The three anchor schemas

Every resilient entity graph has three anchor schemas. Person. Organization. SoftwareApplication. They cross-reference through creator, publisher, and about fields. Miss one anchor and the graph reads as a disconnected individual, a disconnected company, or a disconnected tool. Engines pick the version of your story they find easiest to summarize, which may not be the version you want quoted.

§3 — sameAs coherence across platforms

The sameAs array is the single most valuable field in the Person schema for sub-DR-20 brands. It is also the field most brands let drift. Six platforms. Six subtly different job titles. Six subtly different descriptions. Engines treat this as ambiguity and choose a canonical, usually the platform with the highest authority, not the one you care about.

§4 — The knowsAbout surface nobody uses

knowsAbout is the declarative authority claim. Use it to stake your territory. For chudi.dev, the stake is AI Visibility Engineering, Generative Engine Optimization, Entity Graph Architecture, and Sub-DR-20 SEO. Four claims. Four phrases an engine can match against a query.

§5 — The citation flow

When a query arrives, an AI engine runs a pipeline. Retrieve candidate entities. Score coherence. Gate. Cite. Optimizing for citation means shortening the distance between query and gate.

§6 — Measurement: how you know it is working

The answer engines that matter (Perplexity, ChatGPT with search, Google AI Overviews) do not expose rank. They expose citations. Measurement is a different instrument than Google Search Console. Track citation count per engine, quote length per citation, and coherence-check failures across your sameAs graph.

Bridge — from thinking to measurement

This post describes the thinking. citability.dev is the instrumentation. Run the citability scorer on any chudi.dev post to see what the engine actually extracts, which entity it resolves to, and where the coherence gates are failing.

ADHD Remote Work for Developers: Build for Context Decay, Not Perfect Focus

Chudi Nnorukam — Fri, 10 Apr 2026 23:55:49 +0000

Originally published at chudi.dev

I worked from home for eight months before I figured out I was failing at remote work specifically because of ADHD, not despite doing everything "right."

I had a desk. A monitor. A to-do list. I woke up at the same time every day. And still, whole afternoons would vanish. I'd look up at 5pm, realize I'd been deep in a rabbit hole since 11am, and have nothing to show for it that anyone would call work.

The problem wasn't focus. ADHD gets described as a focus problem, but that's wrong. My focus was fine. It was just pointed at the wrong things for hours at a time, with no external force to redirect it. That's what remote work does to ADHD brains: it removes every environmental cue that normally handles redirection for you.

Here's what I replaced those cues with.

Why remote work hits ADHD harder than most people realize

An office is full of behavioral scaffolding you don't notice until it's gone.

The commute is a transition ritual. It physically separates "home brain" from "work brain." Without it, you start working and your nervous system never quite shifts modes.

Other people visibly working is a constant time cue. You glance up and see colleagues at their desks, it's 2pm, you haven't eaten. That passive awareness regulates your own sense of time. Without it, time blindness kicks in fully.

Scheduled meetings are anchors. Even if you hate meetings, they force context switching at predictable intervals. They're external interrupts the ADHD brain doesn't have to generate internally.

Remote work strips all three. What you're left with is a quiet room, unlimited flexibility, and a brain that cannot generate external structure from nothing.

Flexibility, for ADHD, is not a feature. It's a trap.

Environment design: make your space do the transition work

The most reliable fix I found is physical environment separation. Not because it's psychologically important in some abstract way, but because ADHD brains form strong context associations. If you sleep, relax, and work in the same chair, your brain genuinely cannot tell which mode to be in.

If you have a separate room, use it only for work. Close the door when you're done. The physical act of closing the door is the end-of-day ritual. Open it again when starting. That's the whole commute.

If you don't have a separate room (I didn't for two years), use environmental cues instead:

A specific chair at a specific surface. Not the couch. Not the bed. One chair that means work. Sit in it, you're working. Leave it, you're not.

Distinct lighting. I use a desk lamp on a smart plug scheduled to turn on at 9am and off at 6pm. When it turns on, that's the ambient "it's work time" signal. This sounds absurd. It works.

A start ritual that's identical every day. Mine is: make coffee, carry it to the desk, open a specific playlist. Three steps, always the same order. The ritual triggers the context switch. Willpower doesn't have to.

The goal isn't a beautiful home office. It's tricking your nervous system into associating a location with a mode.

Time anchors: replace meetings with artificial deadlines

Time blindness is the ADHD symptom that remote work makes worst. In an office, you feel time passing through social cues. At home, four hours can feel like forty minutes.

The fix is time anchors, and the most effective one I've found is a scheduled video call. It doesn't matter what the call is about. It just has to happen at a predictable time, because your brain will orient itself around it. "I have a call at 2pm" creates a before-the-call and an after-the-call. That's two time blocks with structure, instead of one infinite formless day.

If you don't have regular calls, manufacture them:

Daily standup with yourself. Five minutes on video, speaking out loud, covering what you did yesterday and what you're doing today. The video part matters. Talking to a camera activates social awareness in a way typing doesn't.

Focusmate sessions. Two people, 50-minute video session, each says what they're working on, then works silently, then checks back in. It's structured, it ends, and the commitment to another person is enough accountability to keep most ADHD brains on task.

90-minute alarms. Not as a task timer. As a "what are you doing right now" interrupt. When it goes off, write one sentence in a running doc: what you're doing, whether it's the right thing. That's it. The awareness loop is enough to reduce hyperfocus drift.

Async communication: close every loop explicitly

Open communication loops are ADHD kryptonite at home.

In an office, you send a Slack message and you can see the person at their desk. You know they're alive. You have some sense of whether they saw it. At home, you send something and then... nothing. That uncertainty sits in your brain as an open tab, consuming attention.

The fix is aggressive loop-closing, on both ends.

Send explicit "done" messages. When you finish a task that someone asked for, message them directly: "Done, PR is up." Don't assume they'll see the PR notification. Close the loop yourself. This sounds like extra work but it's actually cognitive offloading — you're moving the "did they see it" question out of your head and into their inbox.

Batch your Slack/email. Check twice a day: 9:30am and 3:30pm. Outside those windows, quit the app. I mean actually quit it, not just minimize it. The notification bubble sitting in your dock is a distraction even when you're not looking at it.

Use Loom instead of "let's hop on a call." When you need to explain something complex, record a 3-minute video. It's faster to make than scheduling a call, the other person can watch it at their pace, and you don't have to hold the context of what you were going to say while waiting for a meeting slot to open up.

Set explicit Slack statuses. "Deep work until 2pm" tells your team not to expect synchronous response. It also reminds you what you're supposed to be doing when you look at your own status and see it staring back.

Body doubling: the remote version

Body doubling is working in the presence of another person. It activates the social engagement system, which provides the kind of external accountability ADHD brains need to sustain attention on boring tasks. It sounds like it shouldn't work. It works.

The office version is automatic. The remote version requires effort to set up, but it's worth it.

Focusmate (focusmate.com) is purpose-built for this. You book 50-minute sessions with a random partner, you both say what you're working on, you work silently, you check back in at the end. It's free for a few sessions per week, paid for unlimited. I've used it to get through tax returns, documentation, and every other task that normally gets avoided indefinitely.

Discord co-working servers. There are ADHD-specific ones with always-on voice channels where people work silently together. You join, unmute when you want company, mute when you need quiet. The ambient presence of other people is enough.

A recurring call you never close. I have a weekly video call with a friend who also works remotely. We don't talk during it. We just exist on the same screen, working. When one of us needs to say something, we say it. This is body doubling. It costs nothing and it's sustainable.

Managing your manager remotely with ADHD

The part nobody talks about: ADHD makes remote work harder because managers can't see you working. In an office, your presence is visible evidence of effort. At home, you have to manufacture that evidence.

This isn't about performing productivity. It's about removing the anxiety of wondering whether your manager thinks you've disappeared.

Daily end-of-day summary, one sentence. "Finished the auth PR, reviewed two tickets, starting the cache refactor tomorrow." Send it every day, at the same time, without being asked. It closes the loop for your manager and it forces you to acknowledge what you actually did, which matters for your own ADHD sense of accomplishment.

Overcommunicate blockers. When you're stuck, say so immediately. ADHD means stuck can turn into three hours of silent spiraling without warning. Saying "I'm blocked on X, any context?" gets you unblocked faster and signals that you're engaged, not absent.

Use async video for updates that would otherwise be a meeting. A 2-minute Loom is easier for everyone than a 30-minute call. It respects your manager's time, it documents the update, and it proves you exist and are thinking.

What I stopped doing

I stopped trying to simulate the office at home. I don't have scheduled "office hours" where I sit at my desk whether or not I have anything to do. I don't use time-blocking as a rigid constraint because rigid constraints with ADHD become guilt traps when they break, which they always do.

Instead, I work with the parts of ADHD that actually help remote work. Hyperfocus is an asset when pointed at the right thing, so I protect large uninterrupted blocks in the morning. The lack of commute means I can start working when my brain is actually ready, not when a train schedule says I should be at a desk.

Remote work with ADHD doesn't require suppressing the ADHD. It requires replacing the scaffolding the office provided with intentional, homemade versions of the same cues.

The cues work. The willpower approach doesn't. That's the whole thing.

If you want to compare notes on what's working, the contact is at the bottom of the page. Remote ADHD work is still being figured out by most of us in real time.

Sources

Self-Improving RAG: Teaching Claude Code to Learn From Errors

Chudi Nnorukam — Fri, 10 Apr 2026 23:55:21 +0000

Originally published at chudi.dev

I was debugging the same authentication error for the third time this month.

Same error. Same root cause. Same fix.

Claude Code had solved this exact problem two weeks ago—but it didn't remember. Each session starts fresh. No memory of what worked, what failed, or what patterns emerged.

That's a massive waste of debugging time.

So I built a system to fix it.

The Problem With Stateless AI

Claude Code is powerful, but it has a fundamental limitation: every session starts from zero.

This means:

Same mistakes repeated across sessions
No accumulation of project-specific knowledge
Debugging loops that should take minutes take hours
Learnings trapped in conversation history, never extracted

The irony? Claude Code can solve complex problems. It just can't remember that it already solved them. This is where building a self-improving RAG system becomes transformative.

Introducing the Self-Improving RAG System

I built a configuration that makes Claude Code learn from every debugging session.

The core idea: automatic capture, structured storage, intelligent retrieval.

When something breaks, the system captures it. When something works, the system remembers it. When a session ends, the system reflects on what happened.

No manual logging. No /learn commands for every insight. The system watches, learns, and improves. This is built on the principles of Retrieval-Augmented Generation (RAG)—using external knowledge to enhance AI responses—combined with Claude Code's development capabilities.

Architecture: Three Memory Layers

The system uses three complementary memory approaches:

┌─────────────────────────────────────────────────────────────────┐
│                     Knowledge Layer                              │
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │ ChromaDB     │  │ Graph Memory │  │ CLAUDE.md    │          │
│  │ (Vectors)    │  │ (Relations)  │  │ (File)       │          │
│  └──────────────┘  └──────────────┘  └──────────────┘          │
│                                                                  │
│  Collections:          Relationships:     Sections:             │
│  • error_patterns      • error→file       • Known Pitfalls      │
│  • successful_patterns • error→fix        • Successful Patterns │
│  • project_learnings   • fix→file         • Error History       │
│  • meta_learnings      • decision→outcome                       │
└─────────────────────────────────────────────────────────────────┘

Layer 1: ChromaDB (Semantic Search)

Vector embeddings enable semantic search across all captured knowledge.

Collections:

error_patterns: Build failures, type errors, runtime exceptions
successful_patterns: What worked—deployment patterns, architecture decisions
project_learnings: Project-specific insights
meta_learnings: Process improvements from self-reflection

When I search "authentication errors," I get relevant results even if the exact phrase wasn't used.

Layer 2: Graph Memory (Relationships)

ChromaDB stores flat documents. But knowledge has structure.

Graph memory tracks relationships:

Error ──occurred_in──→ File
  │
  └──fixed_by──→ Fix ──applied_to──→ File

Decision ──led_to──→ Outcome
                        │
Learning ←──learned_from─┘

This enables queries like:

"What errors have occurred in auth.ts?"
"What fixes have been applied to the API module?"
"What decisions led to successful deployments?"

Relationships reveal patterns that flat search misses.

Layer 3: CLAUDE.md (Project Memory)

Each project maintains a CLAUDE.md file—a living document that Claude reads at session start.

Sections:

Known Pitfalls: Project-specific gotchas (auto-populated by hooks)
Successful Patterns: Code patterns that have worked
Error History: Recent errors with resolutions

This provides immediate context without database queries.

Automatic Learning Capture

The magic is in the hooks—scripts that intercept Claude Code events.

Hook: Capture Failures

When a build or test fails:

# capture_failure.py (PostToolUse hook)
def capture_failure(tool_result):
    if tool_result.exit_code != 0:
        error = extract_error(tool_result.output)
        store_to_chromadb({
            "type": "error_pattern",
            "error": error,
            "file": tool_result.file,
            "timestamp": now()
        })
        update_graph("error", error, "occurred_in", tool_result.file)

No manual intervention. Failures get captured automatically.

Hook: Track File Edits

When files are modified:

# log_edit.py (PostToolUse hook)
def log_edit(tool_result):
    if tool_result.tool == "Edit":
        update_graph("fix", tool_result.diff, "applied_to", tool_result.file)

This builds the error→fix→file relationship over time.

Hook: Session Summary

When a session ends:

# session_summary.py (Stop hook)
def session_summary():
    learnings = extract_learnings(session_history)
    update_claude_md(learnings)
    store_to_chromadb(learnings)

The system extracts what was learned and persists it.

Self-Reflection: Meta-Learning

Beyond capturing individual learnings, the system reflects on patterns.

At session end, a self-reflection agent analyzes:

What approaches were effective
What inefficiencies occurred
What patterns emerged

These meta-learnings go into a separate collection—insights about the development process itself, not just specific bugs.

Example meta-learning:

"When debugging TypeScript type errors, checking the imported types first resolves 70% of issues faster than tracing through the codebase."

This is knowledge about how to debug, not just what broke.

Memory Decay: Keeping Knowledge Fresh

Old knowledge becomes stale. A fix that worked six months ago might not apply to the current architecture.

Memory decay solves this:

Half-life: 90 days (configurable)
Minimum weight: 0.1 (never fully forgotten)
Access boost: Recently used memories get +20% weight

The result: Claude prioritizes recent, actively-used knowledge while maintaining a long-term memory of rare but important patterns. This is similar to the token optimization strategies I've outlined for reducing AI token usage, where selective information display keeps context efficient.

Custom Commands

The system adds slash commands for manual interaction:

Command	What It Does
`/learn`	Manually capture a learning from the current session
`/search-knowledge "query"`	Search all memory layers
`/review-plan`	Validate a plan against past learnings
`/reflect`	Trigger self-reflection analysis
`/memory-stats`	Show knowledge base statistics
`/memory-maintain`	Run decay, merge duplicates, archive old memories

Most learning happens automatically. These commands are for when you want manual control.

Effort-Based Routing

Not every task needs maximum AI capability. The system classifies prompts:

Level	Example	Token Impact
`low`	"What's the syntax for..."	Fastest, cheapest
`medium`	"Implement this feature"	Balanced
`high`	"Architect the auth system"	Maximum capability

Simple questions get fast answers. Complex problems get deep thinking.

Real Results

After two months of use:

Before (vanilla Claude Code):

Same auth bug: 45 minutes to debug (third time)
Build failures: Manual pattern recognition
Session knowledge: Lost after conversation ends

After (self-improving RAG):

Same auth bug: /search-knowledge "auth middleware" → fix in 2 minutes
Build failures: Automatic capture, searchable history
Session knowledge: Persisted, searchable, improving

The system has captured 150+ error patterns, 45 successful patterns, and 80 meta-learnings across my projects. For more on Claude Code workflows and best practices, see my comprehensive Claude Code guide.

Getting Started

The system is available as a configuration you can install:

cd ~/Projects/active/claude-rag-config
./setup.sh

# Then in any project:
claude

Setup installs:

Hooks for automatic capture
Custom commands for manual control
ChromaDB for vector storage
Graph memory database
CLAUDE.md template

Requirements: Python 3.9+, Node 18+, ChromaDB (pip install chromadb)

Getting Started: The Minimum Viable RAG Setup

You don't need the full system on day one. Here's the smallest version that actually makes a difference.

Step 1: Install ChromaDB

pip install chromadb

That's your vector store. One command.

Step 2: Create a capture hook

Create a file at ~/.claude/hooks/post_tool_use.py:

import json, sys, chromadb, hashlib
from datetime import datetime

data = json.loads(sys.stdin.read())

if data.get("tool") == "Bash" and data.get("exit_code", 0) != 0:
    client = chromadb.PersistentClient(path="~/.claude/memory")
    collection = client.get_or_create_collection("error_patterns")

    error_text = data.get("output", "")[:500]
    doc_id = hashlib.md5(error_text.encode()).hexdigest()

    collection.upsert(
        documents=[error_text],
        ids=[doc_id],
        metadatas=[{"timestamp": datetime.now().isoformat()}]
    )

print(json.dumps({"continue": True}))

This captures every failed Bash command into ChromaDB. No manual intervention.

Step 3: Add a search command

Add this to your CLAUDE.md:

## /search-errors command
When user types /search-errors "query":
1. Query ChromaDB error_patterns collection
2. Return top 3 similar past errors and their context
3. Suggest fixes based on patterns

Step 4: Add a project-specific CLAUDE.md section

## Known Pitfalls (auto-updated)
<!-- This section gets updated by the session summary hook -->

That's the minimum viable setup. Four steps, maybe 20 minutes. You won't have graph memory or self-reflection yet--but you'll have semantic search over your past errors, which is where most of the day-to-day value comes from.

The auth bug I mentioned at the top of this post? The minimum viable version would have caught it. The error was in the database. The fix was two queries away.

Build the full system when the minimum version proves itself. For me, that took about 3 weeks.

The minimum version will change how you think about debugging. Instead of starting from scratch each session, you'll start with a search. That shift alone is worth the 20-minute setup cost.

What's Next

The system keeps improving. Planned additions:

Cross-project learning: Patterns that work in one project suggested in others
Confidence scoring: How reliable is this learning based on how often it's worked?
Team memory: Shared knowledge base across collaborators

The Bigger Picture

This isn't just about remembering bugs.

It's about accumulating developer judgment in a searchable, queryable format.

Every debugging session teaches something. Without capture, those lessons evaporate. With this system, they compound.

Claude Code doesn't just help you code. It becomes a repository of everything you've learned about your codebase—and it gets smarter every session.

Questions about implementing this in your workflow? Reach out on LinkedIn.

Sources

Claude Code Best Practices (Anthropic)
Claude Code Hooks Documentation (Anthropic)
ChromaDB Documentation (Chroma)

Bug Bounty Automation: Building Security Workflows That Scale

Chudi Nnorukam — Fri, 10 Apr 2026 23:54:50 +0000

Originally published at chudi.dev

My first automated bug bounty scan found 47 "critical" vulnerabilities.

I submitted 12 reports. Every single one was a false positive.

The program I targeted now knows my name. Not in a good way.

That specific embarrassment is what made me rebuild everything from scratch. Not a faster scanner. Not a better scanner. A fundamentally different approach to what automation should and shouldn't do in security research.

This guide is the result: a complete system for bug bounty automation that actually works in production.

What Bug Bounty Automation Actually Is (and Isn't)

Bug bounty automation is not a script that finds vulnerabilities for you.

That framing leads directly to 47 false positive submissions and a wrecked reputation.

What it actually is: a system that handles the mechanical parts of security research — reconnaissance, asset discovery, initial scanning — while keeping humans in control of the decision that matters most: what to submit.

The best automation makes you a more effective researcher. It doesn't replace your judgment. It amplifies it.

What automation handles well:

Subdomain enumeration across certificate transparency logs
Technology fingerprinting at scale
Running known payload patterns against hundreds of endpoints simultaneously
Tracking which findings have been validated vs. just detected
Generating properly formatted reports for each platform's requirements

What automation handles poorly:

Novel vulnerability classes that don't match existing patterns
Context-aware exploitation (is this XSS actually exploitable in this specific app context?)
Deciding whether a finding is worth a researcher's reputation
Anything that requires reading the room on a specific target

Understanding this division is more important than any technical decision you'll make.

The Core Architecture: 4 Agents, One Orchestrator

After rebuilding the system twice, the architecture that works is a 4-agent pipeline coordinated by a central orchestrator.

Orchestrator (Claude Opus)
├── Recon Agents (parallel)
├── Testing Agents (max 4 concurrent)
├── Validation Agent (single, evidence-gated)
└── Reporter Agent (platform-specific formatters)

The orchestrator is a project manager, not a worker. It distributes tasks, manages rate limit budgets, detects agent failures, and persists session state between runs. It never touches an endpoint directly.

Recon Agents

Recon runs in parallel across multiple discovery methods:

Subdomain enumeration via certificate transparency (crt.sh, Censys)
Technology fingerprinting with httpx to identify frameworks, servers, CDNs
JavaScript analysis for hidden endpoints, API keys in source, internal route paths
GraphQL introspection where applicable

All discovered assets feed into a shared SQLite database. Recon agents never block each other — if subdomain enum hits a rate limit, JavaScript analysis keeps running.

Testing Agents

Testing agents take the recon output and probe for vulnerabilities. I cap these at 4 concurrent to avoid triggering WAFs or rate limits.

What they test:

IDOR: multi-account replay of authenticated requests
XSS: payload injection with response diff analysis
SQL injection: error-based and time-based patterns
SSRF: metadata service probing, internal network access
Authentication issues: token fixation, session handling edge cases

Each testing agent handles one vulnerability class. Failure is isolated — if the IDOR agent crashes, XSS testing continues unaffected.

Validation Agent: The Most Important Part

Here's the thing most bug bounty automation gets wrong: detection is not exploitation.

My payload appearing in a response means nothing. It might be in an error log that's never rendered, in an HTML attribute that's properly escaped, on a WAF block page, or in a JSON response that's never interpreted as HTML.

The Validation Agent's only job is to disprove findings.

The evidence gate process:

Every finding starts with a confidence score of 0.0 to 1.0 based on initial detection (around 0.3 for most). Confidence determines routing, not just advancement:

Confidence	Action
0.85+	Immediate human review queue
0.70–0.84	Same-day batch review
0.40–0.69	Weekly review
Below 0.40	Discarded, pattern logged

To reach 0.85+:

Baseline capture: Normal request with innocuous input. Record response headers, body length, content type.
PoC execution: Same request with malicious payload in a sandboxed environment.
Response diff analysis: Not "does the response contain my payload?" but "does the response differ from baseline in an exploitable way?"
False positive signature matching: Known-harmless patterns get auto-dismissed.

If PoC succeeds and diff analysis confirms exploitability: confidence rises to 0.85+. Queued for human review.

If PoC fails: confidence drops. Finding goes to weekly batch review, not discarded.

This is adversarial validation. The agent is trying to kill findings. Findings that survive are credible.

Since implementing this: 0 false positives submitted across 3 months.

The finding lifecycle is a state machine. Findings move through defined states with explicit transitions:

States: new → validating → reviewed → submitted / dismissed

new → validating (automatic)
validating → validating (confidence adjustment, up or down)
validating → reviewed (0.70+ confidence)
reviewed → submitted (human approval)
reviewed → dismissed (human rejection)

Confidence isn't binary. A finding can gain or lose credibility based on evidence at every step.

Reporter Agent

Once a finding clears human review and gets approved, the Reporter Agent handles formatting. Every platform has different submission requirements. I built a unified findings model plus platform-specific formatters — write the finding once, output to HackerOne, Intigriti, or Bugcrowd format automatically.

The Learning Layer: SQLite RAG

The piece I didn't plan but won't remove.

Every time an agent hits a rate limit, gets banned, or has a finding dismissed, it logs that to a SQLite database with semantic embeddings. Before running against a new target, the orchestrator queries this database — "have we seen this stack before? what broke?"

After 3 months of data, the system meaningfully avoids mistakes it's already made. That wasn't in the original design. I added it after watching the system make the same rate-limit mistake on three targets in a row. The fourth target, it slowed down automatically. That was the moment I stopped thinking of this as a script.

Three tables do most of the work:

Table	Purpose
`knowledge_base`	Semantic embeddings of past findings and techniques
`false_positive_signatures`	Known patterns that look like vulnerabilities but aren't
`failure_patterns`	Recovery strategies for different error types

The first month is calibration, not hunting. The RAG database starts empty. Every finding is evaluated without prior context, so the false positive rate is higher than steady state. By week 2, the system starts filtering patterns it's already rejected. By week 4, confidence scores mean something specific to your programs and testing patterns. Skip the calibration month and month two is chaos.

The Human-in-the-Loop Gate

Full automation for security research is wrong.

Not in a theoretical sense. Wrong in a "your reputation will be destroyed" sense.

Consider two hypothetical researchers. Researcher A submits 200 reports, 50 accepted (25% rate). Researcher B submits 50, 40 accepted (80% rate). Programs trust Researcher B. They triage faster. They pay higher. The acceptance rate compounds over months.

Finding cleared by Validation Agent (confidence 0.85+)
    ↓
Human review queue (checked once per day)
    ↓
[APPROVE] → Reporter Agent formats + submits
[DISMISS] → Logged with reason, updates false positive signatures
[INVESTIGATE] → Flagged for manual testing

Every submission has been through my eyes before it goes to a program. Non-negotiable.

What the system will never do:

Submit reports without human approval
Test targets outside registered bug bounty programs
Test out-of-scope domains (hard-blocked before execution, not just warned)
Exaggerate severity for higher bounties
Auto-resume after a ban without human authorization

After switching to mandatory human review: acceptance rate went above 80%. Programs respond faster because trust is established. Evidence packages prevent disputes.

The slow-down is worth it. 5 high-quality reports per week beats 50 that damage your reputation.

Validation: Why Detection Isn't Exploitation

The validation layer is what makes or breaks a bug bounty automation system. Most systems skip it. That's why most systems produce garbage.

A scanner finding your payload in a response proves nothing. The payload might appear in an error message that's never rendered. It might appear HTML-escaped in an attribute. It might appear on a WAF block page explaining what was filtered. Every one of those looks like a vulnerability to a pattern matcher. None of them are.

Response diff analysis is the fix. Instead of asking "is my payload in the response?" the validation agent asks "does the response differ from baseline in an exploitable way?"

Pattern	Why It's a False Positive
Payload in error message	Error messages aren't rendered as HTML
Payload in JSON response	JSON with correct Content-Type isn't executed
`<script>` in HTML	Properly escaped, not XSS
403 response with payload	WAF blocked it, not vulnerable
Reflected in `src=""` attribute	Often non-exploitable context
SQL syntax error on invalid input	Input validation, not injection

For XSS specifically: regex can't tell you if JavaScript executes. Browser validation via Playwright loads the target page, injects a marker that fires if code runs, and checks whether that marker triggers. If alert() fires, XSS is confirmed. If not — regardless of how "vulnerable" the response looks — the finding gets rejected.

The false positive signatures database stores every pattern the system has learned to dismiss. Every rejected finding adds to it. After 3 months, it filters hundreds of known-harmless patterns before they reach the review queue.

Before validation: ~40 findings per scan, 2-3 valid (90%+ false positive rate).
After validation: ~40 detections, 8-12 survive for human review, 5-7 valid (~40% false positive rate at review stage).

Still not perfect. But humans now review 12 findings instead of 40 — and 60% of what they see is real.

Failure Recovery: The 6 Categories

My testing agent hit a rate limit at 2 AM. It retried immediately. Got rate limited again. Retried. Rate limited. Retried faster. By morning, I was IP-banned from the target's entire infrastructure.

That specific failure taught me that error handling in security automation isn't optional. Generic retry loops make things worse. Every error needs classification first.

Category	Detection Pattern	Recovery Strategy
Rate Limit	HTTP 429, "too many requests"	Exponential backoff (2x multiplier, 1hr max)
Ban Detected	CAPTCHA, IP block, consecutive 403s	Immediate halt + human alert
Auth Error	401, expired token, invalid session	Credential refresh + retry (3 max)
Timeout	No response >30 seconds	Reduce parallelism + extend timeout
Scope Violation	Testing out-of-scope domain	Remove from queue + blacklist
False Positive	Validation rejection	Log pattern + update signatures

Exponential backoff for rate limits: 30s, 60s, 120s, 240s, capped at 1 hour. The ceiling matters. HackerOne resets rate limits every 15 minutes — waiting 4 hours wastes time.

Ban detection has highest priority. It checks before rate limit detection. When triggered: all agents stop immediately, human alert fires, session state saves for investigation. Never auto-resume. Human must explicitly authorize continuation.

Escalation threshold: same error category 5+ times within 5 minutes triggers human intervention. First-occurrence rate limits and single timeouts never escalate.

Before categorized recovery: ~30% of scans interrupted by unhandled errors, bans monthly.
After: ~5% need human intervention, zero bans in 6 months, 200+ learned error signatures.

Multi-Platform Integration

HackerOne needs severity ratings with their specific weakness taxonomy. Intigriti wants different field names and inline severity justification. Bugcrowd has unique bounty table structures. Without a unified model, you end up maintaining three separate report generators for the same findings.

The approach that works: one internal findings model with three platform-specific formatters. Every agent works with the unified model. Platform awareness lives only at two boundaries — ingestion (pulling scope from platforms) and submission (sending reports to platforms). Everything between is platform-agnostic.

interface Finding {
  id: string;
  title: string;
  description: string;
  vulnerabilityType: VulnType;
  cvssVector: string;        // Full CVSS v3.1 vector
  cvssScore: number;         // Calculated from vector
  severity: 'critical' | 'high' | 'medium' | 'low' | 'informational';
  poc: { steps: string[]; curl?: string; script?: string; };
  evidence: { screenshots: string[]; requestResponse: string[]; hashes: string[]; };
  confidence: number;
  status: FindingStatus;
}

Each platform formatter implements the same interface: format, validate, submit. They transform the unified Finding into what each platform expects. HackerOne maps vulnerability types to their weakness taxonomy IDs. Intigriti uses different field names. Bugcrowd requires bounty table entries mapped from severity.

The Budget Manager tracks API rate quotas per platform. Before every API call, agents check canRead() or canWrite(). If exhausted, the request queues until quota resets.

A first-mover priority system monitors all three platforms for programs launched in the last 24 hours. New programs get immediate passive recon. Active testing starts after a 2-4 hour delay for scope to stabilize. Early submissions on new programs have higher acceptance rates — less competition, more unreported surface area.

Tools and Stack

Orchestration: Claude Opus (orchestrator), Claude Haiku (testing agents)
Recon: httpx, subfinder, amass, crt.sh API
Testing: Custom Python agents per vulnerability class, Playwright for JS analysis
Validation: Docker sandboxed execution, custom response diff library
Storage: SQLite with sqlite-vec for semantic search
Platform integration: HackerOne API, Intigriti API, Bugcrowd API
Infrastructure: VPS ($40/mo) — not serverless, you need persistent state. See my Python agent deployment guide for setup
Total monthly cost: ~$180 ($40 VPS + ~$140 Claude API)

What I'd Do Differently

Start with the Validation Agent, not the scanner. The scanner is interesting. The validation layer is what actually matters. Build it first.

Cap concurrent agents at 4 from day one. Started with 10. Got IP-banned from 3 programs in two weeks.

Build the human review queue before anything else. The moment you can submit without a gate is the moment you will. Build the gate first.

Accept that it won't make you rich quickly. This system makes you roughly 3.5x more effective. That's the actual value proposition.

Current Results (3 Months In)

12 active programs being monitored
~30 findings surfaced for human review per week
~4-6 submitted after review
0 false positives submitted
~$180/month running cost
~3.5x throughput increase vs. manual research

Building something similar? The hardest part is the validation layer. Start there — everything else is just plumbing.

The multi-agent patterns behind this system are in the Battle-Tested Builder Kit — CLAUDE.md templates, agent routing rules, and verification gates you can drop into your own projects.

Sources

OWASP Web Security Testing Guide (OWASP)
OWASP Top Ten (OWASP)
MITRE CWE (MITRE)

Claude Code vs Cursor vs GitHub Copilot: Which One Actually Ships Better Production Code?

Chudi Nnorukam — Fri, 10 Apr 2026 23:54:36 +0000

Originally published at chudi.dev

I spent three months building a trading bot in production. Real money on the line. 4,000 lines of Python across 22 files. WebSocket feeds from Polymarket, Binance price data, Chainlink oracles, SQLite databases, and a systemd deployment pipeline.

During those three months, I used Claude Code for 95% of the work. But I also tested Cursor and GitHub Copilot on the exact same codebase to understand where each tool actually excels.

All three tools are good. But they solve completely different problems.

Claude Code shipped the bot. Cursor could have shipped it faster if I sat at the keyboard the whole time. Copilot could autocomplete most of it if I knew exactly what I wanted to write.

I paid for all three tools myself. Claude Code costs me $200/month, Cursor is $20/month, Copilot is $19/month. I have skin in the game to pick the right tool.

Here's what nobody tells you: picking the wrong tool doesn't just slow you down. It trains you to work differently. I watched a friend spend six months with Copilot autocomplete, then switch to Claude Code and feel completely lost because he'd built a mental model around "I drive, the tool types." Claude Code requires the opposite mental model. The tool drives. You supervise.

That inversion is where most developers get tripped up. They grab whatever their team already uses, force it to do things it wasn't designed for, and blame the tool when it underperforms. Meanwhile the engineer down the hall using the right tool for the right workflow ships twice as fast.

What Does Each Tool Actually Do Best?

Claude Code is an autonomous agent: it reads your codebase, writes code, runs tests, and fixes failures without you in the loop. Cursor is an IDE built for inline editing speed. GitHub Copilot is autocomplete. Each excels at a different layer of the coding workflow.

Best for production systems with real money: Claude Code (prevents costly mistakes via instruction system)

Best for code editing speed: Cursor (2-3x faster than Claude Code's terminal workflow)

Best for pure autocomplete: GitHub Copilot (trained on GitHub, knows all patterns)

Feature	Claude Code	Cursor	GitHub Copilot
Multi-file editing	Autonomous (20+ files)	Manual per file	Manual per file
Cost/month	$100-200 (Max plan)	$20 (Pro)	$10-19
Best for	Architecture, refactors	Inline editing	Autocomplete
Context window	200K tokens	128K tokens	Limited
Terminal integration	Native CLI	IDE plugin	IDE plugin
Autonomous execution	Yes	Partial	No

How Did I Test These? Same Codebase, Same Tasks, Real Metrics

I used the same 4,000-line Python trading bot as the test environment for all three tools. Same five tasks, same codebase, same definition of done: all tests pass, no LSP errors, feature works in production. I timed every task from "start" to "verified complete."

I didn't run contrived benchmarks. I used each tool to solve actual problems in a real production trading bot.

The codebase:

4,000 lines across 22 Python files
WebSocket integrations, asyncio loops, SQLite database layer
Real external dependencies (py-clob-client, Binance SDK, web3.py, Chainlink feeds)
87 unit tests

The tasks:

Add a new signal source (Chainlink oracle, 150 lines)
Refactor position tracking across 5 files (200 lines changed)
Fix a bug in accumulator state machine (10 lines, wrong location)
Deploy and verify on VPS via SSH
Write a test file from scratch (80 lines)

How I measured:

Time from "start" to "all tests pass"
Number of iterations before correct solution
Whether the tool caught type errors before runtime
Whether the tool understood cross-file dependencies

Why Does Claude Code Win for Production Systems?

Claude Code wins for production systems because it's the only tool that understands your entire codebase, enforces your architecture rules via CLAUDE.md, runs tests autonomously, and catches type errors before runtime. For multi-file work with real money on the line, that autonomy is worth $200/month.

Claude Code is not a copilot. It's an agent that can explore your codebase, understand dependencies, write code, run tests, and fix failures without you touching the keyboard.

Multi-file autonomy

I said "add a Chainlink oracle feed to the signal bot." Claude Code:

Explored the codebase structure (Glob, Grep, lsp_workspace_symbols)
Read existing signal sources to match patterns
Created the new oracle module
Wired it into signal_bot.py
Added it to config.py with safe defaults
Wrote tests
Ran the test suite
Fixed failures without asking

150 lines written. Zero follow-ups needed. 45 minutes elapsed. All tests passed on first try.

Cursor and Copilot could not do this. They would write individual files, and I would have to wire them together, run tests, and tell them what broke. This is the core difference that makes Claude Code a force multiplier for large refactors and architecture work.

The instruction system (CLAUDE.md)

I maintain a project instructions file that Claude Code reads on startup:

- Architecture: "All database operations use async context managers"
- Naming: "Signal modules are signal_<name>.py"
- Error handling: "All state machine transitions log to SQLite"
- Deployment: "Never use sed -i on .env. Always backup first"
- Testing: "Run pytest before deployment. Check lsp_diagnostics for type errors"

Claude Code follows these instructions. Cursor and Copilot don't even know they exist.

Example: I had a bug where config.py loaded .env via load_dotenv() on every import. This caused all instances to read the wrong config. The fix was in my instruction file: "Never use load_dotenv(). Pass ENV_PATH explicitly." Claude Code caught this when reviewing other code. Cursor would not.

Type checking and diagnostics

Claude Code runs LSP diagnostics and pytest before declaring victory. It catches 80% of runtime errors at write time.

# Claude Code ran lsp_diagnostics after editing position_executor.py
# Output: error at line 47: "position_id" is not defined
# Claude Code read the file, found the typo, fixed it
# Never got to runtime

Cursor has inline type hints but doesn't proactively check. Copilot has no type awareness. This automated verification is critical for production systems. I built mine with a two-gate verification system that Claude Code enforces via the instruction system.

Where Claude Code falls short

Terminal-only workflow. Claude Code is a terminal agent. For single-file edits, Cursor is 10x faster. Editing a line in Cursor takes 2 seconds. Editing via Claude Code takes 20 seconds (read, understand, edit, verify, diagnostics).

Expensive. $200/month on the Max plan. For small projects, it's not worth it. For my use case (22 files, multi-file refactors, real money), Claude Code paid for itself by preventing 2 bugs that would have cost $50+ each. If you're wondering if it's worth the cost, check how I built my trading bot. That project shows the real ROI.

Can go off the rails. Agents can hallucinate. I've had Claude Code delete the wrong file, write tests that don't test anything, and suggest changes that break other parts. The safety valve is always: "Did you run tests? Are all diagnostics clean?" This is why I built my AI code verification system: two gates before every deploy.

Learning curve. You need to understand prompts, git, bash, and context management. If you're building ADHD-friendly workflows, Claude Code's instruction system is a game-changer: see how I use it for focused work. Cursor and Copilot work in any IDE without ceremony.

Why Is Cursor the Fastest for Editing?

Cursor beats every other tool on single-file edit speed. Highlight code, describe the change in chat, accept: 5 seconds versus 25 seconds in Claude Code's terminal workflow. If you spend 4 hours a day editing existing files, Cursor saves you 3+ hours per week. That's the one thing it does better than everything else.

Cursor is VS Code with AI built in: tab autocomplete trained on your codebase, inline chat, Composer for multi-file editing, and @codebase context that understands your entire repo.

Inline editing speed

I timed myself editing the same file in both tools.

File: position_executor.py (200 lines). Task: "Add a size calculation that scales with volatility."

Claude Code: Read file, understand context, edit via Edit tool, verify, run diagnostics = 25 seconds
Cursor: Highlight region, type in chat, accept changes = 5 seconds

If you spend 4 hours a day editing code, Cursor saves you 3+ hours per week.

@codebase understanding

Cursor's @codebase context is genuinely good. I asked "Where are all the places we parse market prices?" and it found all three locations across different files. All correct, all in one search.

Claude Code can do this via lsp_workspace_symbols + Grep, but it's more manual.

Where Cursor falls short

Context limits. I hit the limit trying to refactor the entire signal pipeline (22 files, 4,000 lines). It could only see 15 files at once. Claude Code has 1M context tokens and can load your entire codebase. See how I manage context for large projects.

No autonomy. Cursor requires you to drive each file. I asked it to add an oracle feed. It wrote the oracle module perfectly. But it didn't wire it into signal_bot.py, didn't update config.py, didn't write tests. I had to ask four more times.

No instruction system. Cursor has no equivalent to CLAUDE.md. You can't set project-wide rules like "always backup .env before editing." It has no memory of your patterns across sessions. See how I use instruction files for focused work.

When Should You Just Use GitHub Copilot?

Use GitHub Copilot when your primary workflow is writing new boilerplate in languages you already know well. It's the cheapest option ($10-19/month), works in every IDE including Vim and PyCharm, and autocompletes class definitions, imports, and repetitive patterns at 5x your typing speed. Don't expect it to understand your architecture.

Copilot is the narrowest tool: autocomplete. You type, it predicts the next line. And it's genuinely good at that one thing.

I opened a fresh file and typed class PositionExecutor: with def __init__. Copilot predicted the next 8 lines perfectly. Instance variables, type hints, docstring. Hit Tab, done.

For boilerplate you've written 100 times, Copilot is 5x faster than typing.

The trade-off: Copilot has no multi-file awareness. It doesn't know your architecture. It doesn't run tests. It doesn't know if the code it autocompleted is correct.

# Copilot autocompleted:
position_id = order_response['id']  # Fails: 'id' not in order_response

# Should be:
position_id = order_response['tokenId']  # Correct

Copilot doesn't know the difference. It just saw similar patterns on GitHub.

How Do They Compare Head-to-Head?

The table below covers every meaningful dimension: autonomy, cost, speed, context limits, and learning curve. Claude Code dominates multi-file work. Cursor dominates single-file speed. Copilot dominates cost and breadth. No single tool wins every category.

Feature	Claude Code	Cursor	GitHub Copilot
Autocomplete	No	Yes (trained on your codebase)	Yes (trained on GitHub)
Chat with code	Yes (terminal)	Yes (inline)	No
Multi-file understanding	Yes (LSP + Grep)	Partial (@codebase limited)	No
Multi-file editing	Yes (autonomous)	Partial (Composer)	No
Autonomous refactoring	Yes	No	No
Testing integration	Yes (runs pytest)	No (syntax only)	No
Type checking	Yes (LSP diagnostics)	Partial (IDE background)	No (IDE only)
Instruction system	Yes (CLAUDE.md)	No	No
IDE native	No (terminal)	Yes (VS Code)	Yes (all IDEs)
Single-file edit speed	25s	5s	2s (autocomplete)
Multi-file refactor speed	45 min (autonomous)	2-3 hours (manual)	Not feasible
Cost	$200/month	$20/month	$19/month
Learning curve	High (shell, LSP, git)	Low (IDE, chat)	None (autocomplete)

Which Tool Should You Pick?

Pick Claude Code for production systems with multi-file complexity. Pick Cursor if you edit existing code all day and want IDE-native speed. Pick Copilot if autocomplete is enough and you need the cheapest option across all your IDEs. Most serious developers end up using two or all three.

Or use all three. They don't conflict. Cursor and Claude Code live in different workflows (IDE vs terminal). Copilot enhances both.

Use Cursor for inline editing (fastest for single files)
Use Claude Code for multi-file refactors and testing
Use Copilot for autocompleting boilerplate

What Does This Actually Cost?

All three tools together cost $239/month ($2,868/year). That sounds like a lot until you price your time. Claude Code at $200/month prevented two bugs in my trading bot that would have cost $200+ in lost capital. Cursor at $20/month saves 3-4 hours per week. The math works at senior engineer rates.

Tool	Price	Per Year	Use Case	ROI
Claude Code Max Plan	$200/month	$2,400	Large codebases, autonomous work, testing	Prevents 2-3 bugs per month worth $50+ each
Cursor Pro	$20/month	$240	Single-file editing velocity, IDE native	Saves 3-4 hours per week of keyboard time
GitHub Copilot	$19/month	$228	Boilerplate autocomplete, all IDEs	Saves 1-2 hours per week on routine typing
Total	$239/month	$2,868	All three tools together	Best coverage for all workflows

For my trading bot project, Claude Code cost $800 over 4 months. It prevented bugs that would have cost me $200+ in lost capital. ROI: 4x.

For a smaller project (one person, 500 lines), Claude Code is not worth it. Cursor + Copilot at $39/month is the sweet spot.

The Real Difference: Can This Tool Ship Without You?

The only question that matters for production systems: if you step away for an hour, can the tool keep shipping? Claude Code can. Cursor and Copilot cannot. That's the boundary that determines which tool fits your project.

Claude Code: Yes. Full codebase understanding, tests, deployment verification, post-deploy error checking.

Cursor: Partially. It can edit files fast, but you drive the sequence. You run tests. You deploy.

Copilot: No. It's autocomplete. You write the code, it guesses the next line.

For a trading bot with real money on the line, Claude Code's ability to understand the entire system, write tests, and catch errors before deployment is worth the cost.

For editing speed and IDE-native workflow, Cursor wins.

For pure typing speed, Copilot's autocomplete wins.

My workflow today:

Claude Code for new features, multi-file refactors, testing
Cursor for quick edits in the IDE (when I know exactly what to change)
Copilot for autocompleting boilerplate (when I don't want to type import statements)

All three earn their cost.

Sources

GitHub Copilot Official Documentation (GitHub)
Cursor Documentation (Cursor)
Claude Code Documentation (Anthropic)

I Built a 36,000-Line Production Trading Bot With Claude Code

Chudi Nnorukam — Fri, 10 Apr 2026 23:53:59 +0000

Originally published at chudi.dev

I finished the first version of Polyphemus in 6 weeks. Fully autonomous Polymarket trading bot. 4,000+ lines, real money on the line. Couldn't have shipped it without Claude Code. I also wasted $340 in one month using it wrong.

Four months later, the same codebase is 36,000+ lines. Two live instances running pair arbitrage on BTC/ETH/SOL/XRP. Strategy evolved from directional to market-neutral. Eight silent production bugs found and killed before they touched live capital. All caught by a shadow-first deployment gate I built after month two.

The five principles that got it to 4,000 lines are the same ones that got it to 36,000 without entropy. Here's the case study, updated April 2026.

Why Does Claude Get Dumber as Your Project Gets Bigger?

The bigger your codebase grows, the faster Claude's context window fills up. Each session becomes less useful because the window is full of stale context instead of relevant code. By week three of Polyphemus, I was spending 20 minutes re-explaining context Claude had already "seen." By month one, my token bill hit $340. The problem isn't Claude. It's the missing system around it.

Every Claude Code guide starts with CLAUDE.md tips. Not wrong, just backwards.

The first thing I had to understand was not "how do I write better prompts." It was: why does Claude get dumber as my project gets bigger?

The answer is the context window. Every session loads your project into Claude's working memory. As your codebase grows, that memory fills up faster. By week three, I was re-explaining decisions Claude had made with me two days earlier. Architecture choices, naming conventions, API patterns: all gone. I was paying for Claude to relearn the project I'd already taught it.

That's not a Claude problem. That's a system problem. And the symptoms compound fast: re-explaining context is demoralizing, it produces worse outputs because you're summarizing instead of being precise, and eventually you stop correcting Claude because it feels pointless. You start accepting mediocre outputs. You start doing the "hard parts" manually. You've turned an AI assistant into an expensive autocomplete. By month one, I was close to giving up on the whole approach.

Here are the five things that fixed it.

Principle 1: Context is a Resource. Manage it Like One.

Most developers treat Claude's context window like unlimited RAM: load everything, let it sort out what matters. That approach blew my token budget by 58% and produced hallucinations on files Claude "remembered" but didn't actually have in scope. The fix is three tiers: always-loaded project identity (under 500 tokens), per-session task file (under 1,000 tokens), and explicit on-demand file loading. Nothing else.

Tiered context loading in practice:

Tier 1. Always loaded (under 500 tokens). CLAUDE.md at project root. What the project is, file structure, conventions. Nothing else. The map, not the territory.

Tier 2. Per-session (under 1,000 tokens). A CURRENT_TASK.md file. What you're building today, what files are involved, what "done" looks like.

Tier 3. On demand. Specific files, loaded explicitly. "Read src/core/kelly.py before we start."

Result: average session token usage dropped from ~10,000 to ~4,200 tokens. 58% reduction from one workflow change.

The rule that makes Tier 3 work: never reference a file by name without loading it first. "Update the execution module" produces hallucination. "Read src/execution/orders.py, then update the retry logic" produces accurate output.

Principle 2: Claude's Built-in Memory is Better Than Manual Note-Taking

Claude Code has two memory systems: the CLAUDE.md file you write by hand, and Auto Memory, which Claude writes itself based on corrections you make. Most developers only use the first. Using both cuts the manual overhead of session management by more than half and produces more accurate recall than notes you wrote yourself.

I wasted two weeks maintaining a sprawling set of markdown notes before I discovered this. I was carefully updating files that Claude was already tracking more accurately through auto memory.

What auto memory doesn't do: strategic decisions. If you've chosen PostgreSQL over SQLite for a reason, write that in CLAUDE.md. Auto memory captures patterns. CLAUDE.md captures architecture.

The CLAUDE.md that actually worked for Polyphemus:

# Polyphemus — Claude Context

## What this is
Autonomous Polymarket trading bot. Real money. Kelly Criterion sizing. 
Never lets an exception stop the main loop.

## Hard rules
- Never hardcode API keys. Doppler only.
- All amounts in USDC, not cents. One violation cost a real trade.
- Log every trade decision with rationale BEFORE executing.
- MAX_POSITION_SIZE is a ceiling, not a suggestion.

## What we are NOT doing
- No async. Sync is predictable.
- No ML models. Signal threshold is a float.
- No framework for the trading loop. Too much magic.

300 tokens. That's it. Short CLAUDE.md, accurate auto memory, clean context.

Principle 3: Plan Mode Before You Write a Single Line

Plan mode (/plan) lets Claude research your codebase and propose an approach without making any changes. You review the plan, redirect if needed, then approve. On any task touching more than two files, this single step eliminates the most expensive class of Claude mistake: confident, multi-file output that conflicts with existing architecture.

Without plan mode, Claude wrote 200 lines of code conflicting with an architectural decision buried in a different file. Confident. Wrong. Two hours lost.

With plan mode on anything touching more than two files:

Me: /plan Add circuit breaker to execution module.
    Pause trading after 3 consecutive losses.

Claude: [researches without touching anything]
        Proposed approach: [plan]
        Files: src/execution/orders.py, src/core/state.py
        I noticed MAX_LOSS_DAILY in src/core/config.py —
        should the circuit breaker integrate with that?

Me: Yes, but use config module, not state.py.

Claude: Understood. Implementing now...

Claude caught an integration point I hadn't mentioned. I redirected before any code was written. Plan mode costs nothing and consistently saves hours.

Principle 4: Two Gates, Not One

Two-gate verification means every Claude output clears an automated check (type checks, linting, tests in under 30 seconds) before it gets a human review pass using a fixed 6-question checklist. Before this system, 1 in 6 Claude outputs reached production with an error. After: 1 in 40. On a trading bot, that gap is the difference between an incident log and a boring afternoon.

Gate 1 is automated. A bash script: type checks, linting, tests. 30 seconds. Catches ~60% of errors.

#!/bin/bash
python -m mypy . --ignore-missing-imports && echo "✓ Types" || exit 1
python -m ruff check . && echo "✓ Lint" || exit 1
python -m pytest tests/ -q && echo "✓ Tests" || exit 1

If Gate 1 fails, paste the error back: "Fix only what's causing this error. Nothing else." That last sentence matters. Without it, Claude fixes the error and refactors three other things.

Gate 2 is a 6-question checklist. Five minutes, manual, non-negotiable:

1. Does this do exactly what I asked — not more?
2. Are external API calls using the correct endpoints?
3. Is error handling present on every async/IO operation?
4. Are there hardcoded values that should be env vars?
5. Does this break anything that was already working?
6. Can I explain every line if someone asks me tomorrow?

Question 6 catches the most issues. If I can't explain a line, I don't ship it.

Principle 5: Treat Compaction Like a Power Outage. Plan for It.

When Claude's context window fills, it compacts: nuance gets discarded, recent decisions disappear, and the next response starts from a degraded state. The fix is a pre-compaction ritual at the end of every meaningful session. One prompt to update CURRENT_TASK.md, record new decisions, and write a 2-sentence handoff note for the next Claude instance. Recovery time: 90 seconds.

At the end of every meaningful session, one prompt:

We're wrapping up. Please:
1. Update CURRENT_TASK.md with current state
2. Add new decisions to CLAUDE.md's decisions section
3. Write a 2-sentence next-session starter — what the next 
   instance of you needs to know to resume immediately

That third item is the key. Claude writing handoff notes for Claude produces better handoffs than I can write myself. When a new session starts:

Read CLAUDE.md and the "Next session" section of CURRENT_TASK.md.
Confirm your understanding before we continue.

90 seconds. Full speed.

The Numbers, Updated April 2026

These aren't projections. This is the actual state of a production system that started at 4,247 lines in December 2025 and hit 36,096 lines four months later, running pair arbitrage on BTC/ETH/SOL/XRP. Every number below is from a live system, not a benchmark.

Metric	Before System	After System
Lines of code	4,247 (Dec 2025)	36,096 (Apr 2026)
Avg session tokens	~10,000	~4,200
API cost/month	$136 (month 1, unoptimized: ~$340)	$136 ongoing
Error rate to production	1 per 6 outputs	1 per 40 outputs
Silent bugs caught by shadow gate	0	8 (none hit live capital)
Test coverage	41%	73%
Uptime since December	N/A	99.2%
Claude Code sessions	~180 (Dec)	400+ (Apr)

The system rather than the prompts made the difference. Claude Code is a force multiplier. Without a system, it's an expensive way to ship buggy code faster. For a deeper look at verification workflows, see my post on evidence-based AI code verification.

The 6th principle I'd add today: shadow-first before every live deploy. Run a dry-run instance in parallel. Collect evidence. Gate on n_completed >= 50 before promoting. In April alone, that gate caught a bug where set("BTC") was returning {'B','T','C'} instead of {'BTC'}. The bot would have traded the wrong assets live. Eight bugs like that. Zero P&L damage.

Every principle here was learned the hard way: real bugs, real money at risk, real debugging sessions at 2am on a QuantVPS SSH terminal. I also built a self-improving RAG system that captures these learnings automatically so future Claude sessions don't repeat past mistakes.

What I Didn't Cover Here

The five principles in this post are the foundation. There's a full advanced layer above them: hooks that run Gate 1 automatically after every file write, subagents routing cheap tasks to smaller models, agent teams for parallel feature development, and MCP servers giving Claude direct live database access. Each of these requires the foundation to be working first.

The full advanced system is in the Claude Code Guide: Advanced Edition. It includes hooks that run Gate 1 automatically after every file write (no manual step), subagents routing cheap tasks to smaller models (44% cost reduction with progressive context loading), agent teams for parallel feature development, checkpointing for safe architectural experiments, and MCP servers giving Claude direct access to the live database for debugging. You need the foundation before the advanced layer is useful.

The guide includes the complete Polyphemus architecture walkthrough. If you're deploying your own bot, here's my step-by-step VPS setup on DigitalOcean.

If this case study was useful, the best thing you can do is send it to one developer still burning money using Claude Code without a system.

— Chudi

hello@chudi.dev | chudi.dev

Sources

Building a Python Trading Bot: What Actually Works in Production

Chudi Nnorukam — Fri, 10 Apr 2026 23:53:08 +0000

Originally published at chudi.dev

System Architecture Overview

When I started building a trading bot, I expected the hard part to be the trading logic. It wasn't. The hard part was building a system that could run continuously for weeks without losing state, crashing silently, or entering impossible positions.

A production trading bot needs five core modules that work together:

Signal Generation - Monitors price feeds and generates trade signals
Position Management - Executes trades, tracks holdings, and prevents overlapping positions
Exit Strategies - Knows when to close positions and takes profits or cuts losses
State Persistence - Survives process crashes, power failures, and restarts
Health Monitoring - Detects stuck orders, orphaned positions, and api failures

Each module can fail independently, so the system needs to handle partial failures gracefully. A signal can fail without crashing position management. An API call can timeout without losing the position state. The monitoring system watches everything and alerts when something goes wrong.

The entire codebase is about 4,000 lines of Python. Claude Code wrote 95% of it, including the most complex parts: the asyncio event loop, the database schema and queries, and the deployment scripts.

The repo is private for now, but I'm planning to open source the core trading logic once it's hardened further.

Signal Generation

Signals are the input to the entire system. My signals come from two sources: Binance price momentum and Polymarket market odds.

Binance provides real-time price data via WebSocket. I watch 5-minute price movements and detect breakouts above 2-sigma bands. When BTC price breaks upward sharply, I look for corresponding prediction markets on Polymarket that are underpriced relative to the momentum.

The signal pipeline is documented in my post on Binance-Polymarket momentum signal generation. The key insight is that prediction market prices lag price momentum by 30-90 seconds, creating a small window to profit from the difference.

The Momentum Window

Here's how the signal generation actually works in code:

async def detect_momentum_breakout(candles: list[dict]) -> float | None:
    """
    Watch 5-min BTC candles and detect 2-sigma breakouts.
    Returns signal strength (0-1) if breakout detected.
    """
    closes = [c['close'] for c in candles[-20:]]  # Last 20 candles
    mean = sum(closes) / len(closes)
    variance = sum((x - mean) ** 2 for x in closes) / len(closes)
    std_dev = variance ** 0.5

    current_close = closes[-1]
    z_score = (current_close - mean) / std_dev

    if z_score > 2.0:  # Upside breakout
        return min(1.0, z_score / 3.0)  # Cap at 3-sigma
    return None

When a signal fires, the system calculates how many standard deviations above the mean the price has moved. A 2-sigma move occurs about 2% of the time randomly, but when combined with Polymarket underpricing, the edge becomes real.

The efficiency gap between crypto spot prices and prediction markets exists because:

Crypto moves on technicals and sentiment (fast)
Prediction markets move on fundamental news cycles (slower)
Retail traders on Polymarket have longer decision latency than algo traders on Binance

This isn't a edge I'll have forever. As more traders build similar systems, the 30-90 second window compresses. But for now, it's consistent enough to trade.

Signals feed into a queue. The position manager processes signals one at a time, ensuring we never accidentally open two positions on the same market.

Position Management

Once a signal arrives, the position manager decides: do we take this trade, or skip it?

The decision logic checks:

Do we already have a position in this market? If yes, skip.
Is the order book deep enough to execute at reasonable prices? If no, skip.
Has this market been active for more than 24 hours? Recent markets are illiquid.
Are we at our maximum concurrent positions? If yes, skip.

If all checks pass, the position manager places a limit order on the Polymarket CLOB. The CLOB is Polymarket's central limit order book for derivatives trading. It's lower latency than the REST API but requires understanding the order book structure.

See my detailed post on building the Polymarket trading bot for the specifics of CLOB integration.

The position manager tracks every open position in SQLite. Each position stores:

Market ID and outcome tokens
Entry price and quantity
Timestamp and signal strength
Current mark-to-market value
Status (open, closing, closed)

This database survives process restarts. On startup, the position manager reads the database and reconstructs the exact state it was in before the crash.

Exit Strategies

The hardest part of algorithmic trading is exits. Most retail traders focus on entries but skip profitable or know when to close positions. It's the difference between "cool idea" and "actual profit."

I use three exit types:

Profit target - Close 50% of position at 2% profit, rest at 5% profit
Stop loss - Close entire position if it drops 3% below entry
Time decay - If a position hasn't moved in 4 hours, close it (markets that aren't moving are wasting capital)

The exit manager runs every minute, checks all open positions, and executes exits that meet criteria. It places exit orders as limit orders too, so we get the best available prices.

The Math on Asymmetric Position Sizing

Here's why the split profit target works. Say I enter a position at $0.45 on a binary market with $100 per trade:

Winning trades (55% of the time):

Exit 50% at $0.459 (2% profit): +$0.90
Exit remaining 50% at $0.4725 (5% profit): +$2.25
Total on winner: +$3.15 (3.15% return on $100)

Losing trades (45% of the time):

Stop loss hits at $0.4365 (3% below entry): -$3.00
Total on loser: -$3.00 (-3% return on $100)

Expected value calculation:

EV = (0.55 × $3.15) + (0.45 × -$3.00)
EV = $1.73 + (-$1.35)
EV = +$0.38 per $100 bet

That's positive EV at a 55% win rate. Drop to 54% and the math flips negative. This is why signal quality matters more than quantity. A 60% win rate turns $0.38 per $100 into $0.60 per $100. Signal strength compounds.

The split exit structure protects against reversal. If the market hits 2% and I've already cashed out half, the second half can either hit 5% or get stopped out. Either way, I've locked 50% of the winning outcome. That's risk management.

The asymmetry is deliberate. I lose 3% on losers but capture 2-5% on winners because prices move differently on prediction markets. They don't move in smooth linear paths. They bounce. A tight 1% stop gets triggered by noise. A 3% stop lets the position breathe.

State Tracking for Reliable Exits

The position database tracks exit status for each open position:

CREATE TABLE positions (
    id INTEGER PRIMARY KEY,
    market_id TEXT,
    entry_price REAL,
    entry_qty INTEGER,
    entry_time TEXT,
    stop_loss_price REAL,      -- 0.4365 for 0.45 entry
    target_one_price REAL,     -- 0.459 (2% profit)
    target_two_price REAL,     -- 0.4725 (5% profit)
    exit_status TEXT,          -- 'open', 'half_closed', 'closing', 'closed'
    target_one_filled_at TEXT,
    target_two_filled_at TEXT
);

On every minute, the exit manager reads current market price and compares against these thresholds. When target one hits, it updates exit_status to 'half_closed' and records the fill time. This prevents double-exits and ensures the second half of the position doesn't exit prematurely.

Limit orders are crucial here. Market orders on Polymarket can slip 0.5-1% depending on order book depth. A limit order at target_one_price of $0.459 sits on the book until filled. If the market only reaches $0.458, the order stays open. If it bounces to $0.461, it fills at $0.459 (better than market order at $0.461). Over 100 trades, limit order precision saves 0.3-0.5% in aggregate slippage.

The real lesson: don't set stop losses too tight on prediction markets. Binary outcomes mean prices bounce around more than equity markets. A 1% stop gets triggered by noise. 3% gives the position room to breathe while still protecting against real reversals.

Read my post on betting math for binary markets to understand the probability calculations that feed into exit sizing.

The trickiest exit is time decay. Prediction markets converge to 0% or 100% as the event approaches. If I'm long and the market isn't moving, the time decay works against me. Exiting stale positions frees capital for new signals.

One thing that surprised me: the time decay exit generated more total profit than the profit target exit. Not because individual exits were bigger, but because freeing stale capital meant the bot could take 2-3 more trades per day. Capital velocity matters more than any single trade's P&L.

Paper to Live Transition

I paper-traded for 6 weeks before going live. Paper trading means simulating trades without real money, just tracking P&L in spreadsheet.

Paper trading revealed two strategy failures:

Liquidity trap - The culprit was insidious: entry prices looked good because I was catching markets in transition, when stale limit orders sat on the books. But exit? Nightmare. On markets with under < $50K order book depth, I'd win the entry at $0.45 then get forced out at $0.42 because no one was buying. The tight spread at entry reversed hard at exit. Across 30 paper trades, this cost pattern showed up 7 times. Each time: entry P&L looked profitable (+2-3%), but the exit slippage erased it. One trade: up $2.25 on 50% exit at 2% profit, then the final 50% couldn't execute near target. Ended up closing at $0.41 (-4% from entry) because the order book evaporated. That single trade went from +$3 to -$1. Multiply that by 7 failed exits across the paper period, and I'm looking at roughly $2,000 in prevented losses once I added the liquidity check.

The fix was a simple gate before position entry:

async def check_market_liquidity(market_id: str) -> bool:
    """
    Only enter if order book has sufficient depth.
    Skip if spread > 1% of mid-price or total depth < threshold.
    """
    order_book = await polymarket.get_order_book(market_id)
    best_bid = order_book['bids'][0]['price']
    best_ask = order_book['asks'][0]['price']
    mid = (best_bid + best_ask) / 2
    spread_pct = ((best_ask - best_bid) / mid) * 100

    total_depth = sum(qty for _, qty in order_book['bids'][:5]) + \
                  sum(qty for _, qty in order_book['asks'][:5])

    if spread_pct > 1.0:  # Spread too wide
        return False
    if total_depth < 500:  # Not enough size to exit
        return False
    return True

This single filter would have prevented 7 bad trades and saved ~$2,000 in real losses.

Signal timing - The second failure was time-dependent. Signals arriving during European market hours (roughly 2-8 AM UTC, when US traders sleep) got filled at punishment prices. Same signal, same market, but the order book was thin and slow-moving. I'd get a signal on BTC momentum at 5 AM UTC. By the time I placed the order, retail traders on Polymarket hadn't woken up yet. Bid-ask spread was 0.5-1% instead of 0.2%. Fills were 200-300 basis points worse than signals arriving during US peak hours (12pm-11pm UTC). Over the 30 paper trades, only 6 arrived during European dead hours, but those 6 had 0.5-1% worse fills than identical signals during US hours. That's roughly $1,500 in prevented slippage if I'd filtered those out.

The time-of-day filter was even simpler:

async def is_peak_trading_hour() -> bool:
    """
    Only process signals during peak US trading hours.
    12pm-11pm UTC captures most US market activity.
    """
    now_utc = datetime.datetime.utcnow()
    current_hour = now_utc.hour

    # Peak hours: 12pm-11pm UTC (7am-6pm EST)
    if 12 <= current_hour < 23:
        return True
    return False

async def should_process_signal(signal: dict) -> bool:
    if not await is_peak_trading_hour():
        return False  # Skip this signal
    return True

That's it. One hour check prevented $1,500 in unnecessary slippage by avoiding markets where the book moves like molasses.

Both failures would have cost real money live. The 6-week cost in time was worth it.

I ran at least 30 paper trades before going live. That's the minimum to see the major failure modes. Anything less and you're just guessing.

Sources

Polymarket CLOB API Documentation (Polymarket)
How I Built a 4,000-Line Production Trading Bot With Claude Code (chudi.dev)

Deploy Python Agents on DigitalOcean for $6/Month

Chudi Nnorukam — Fri, 10 Apr 2026 23:52:14 +0000

Originally published at chudi.dev

Disclosure: This post contains affiliate links to DigitalOcean. If you sign up through these links, I earn a commission at no extra cost to you, and you get $200 in free credits for 60 days. I ran my trading bot on a DigitalOcean Droplet before migrating to a specialized VPS for lower latency. I recommend DO for Python agents because I used it and it worked.

My Python trading bot worked perfectly on my laptop. Asyncio event loop, WebSocket connections to Binance, real-time order placement on Polymarket. Clean code. Passed review. Ran great locally.

Then I tried to deploy it.

The first attempt was AWS Lambda. Cold starts added 400ms to every signal. The 15-minute timeout killed my long-running WebSocket connections. I spent two days fighting CloudWatch logs before I realized: Lambda was built for HTTP request handlers, not for a process that needs to stay alive.

Here's what I wish someone had told me: deploying a Python agent that runs 24/7 costs $6/month and takes 30 minutes. Not $50. Not $80. Six dollars.

I ran my Polymarket trading bot on a $6 DigitalOcean Droplet for months. It processed live Binance price feeds, placed orders on the CLOB, and managed exits autonomously. 69.6% win rate across 23 clean trades. The infrastructure never failed me. The server was not the bottleneck. It never was.

This is the setup guide that would have saved me those two days on Lambda.

What does a Python agent actually need?

Most developers pick their deployment platform based on what they already know. If you've used AWS before, you reach for EC2 or Lambda. If you're a Heroku person, you spin up a dyno. Nobody stops to ask: what are the actual infrastructure requirements?

A long-running Python agent requires:

A process that stays alive (not a function that runs and dies)
Persistent connections (WebSocket feeds, database connections)
Predictable cost (not pay-per-invocation that spikes when your bot gets active)
Full control over the runtime (Python version, system packages, cron)

Here's what it does not need:

Auto-scaling (your bot is one process)
Load balancers (it's not serving HTTP traffic)
Managed runtimes (you want control, not guardrails)
A $50/month bill for resources you'll never use

That last point stings. If you're running a single Python agent on Heroku ($7-25/mo), Railway ($5 + metered usage), or an EC2 instance you forgot to right-size ($15-80/mo depending on how lost you got in the AWS console), you're paying for infrastructure designed for problems you don't have.

How do DigitalOcean costs compare to AWS, Heroku, and Railway?

I evaluated four providers before my first deploy. Here's the honest comparison:

Provider	Monthly Cost	Best For	The Catch
AWS EC2	$8-80/mo	You already live in AWS	Console is a maze. Surprise bills are real. A t3.micro costs $8, but you'll add EBS, bandwidth, and by month 3 you're at $35 wondering what happened.
Heroku	$7-25/mo	Quick web app deploys	Dynos sleep on the free tier. Paid tier starts at $7 but a worker dyno for a bot is $25. No SSH access. Limited debugging.
Railway	$5 + usage	Git-push simplicity	Usage-based pricing sounds cheap until your bot runs 24/7. A busy month can cost $20-40.
DigitalOcean	$6/mo flat	Bots, agents, anything that runs 24/7	Fewer regions than AWS. You manage your own server. That's it.

I picked DigitalOcean. $6/month flat. No metered surprises. Full root access. The documentation read like it was written by someone who actually uses the product. I went from zero to running bot in 28 minutes.

Here's what that $6 gets you: 1 vCPU, 1GB RAM, 25GB SSD, 1TB transfer. My trading bot (asyncio event loop, multiple WebSocket connections, real-time order placement) used about 200MB of that RAM. The CPU barely touched 5% between trading signals.

Most of you are overpaying. Let me show you the setup.

What do you need before starting?

You need three things: Python 3.10 or higher on your local machine, an SSH key pair (generate one with ssh-keygen -t ed25519 if you don't have one), and 30 minutes of uninterrupted time.

That's literally all the prerequisites.

Step 1: Create a Droplet (2 Minutes)

Log into DigitalOcean and create a new Droplet:

Region: Pick the closest to your data source. For my trading bot, I chose Amsterdam (5-12ms to Polymarket's CLOB in London). For a general agent, pick the region nearest whatever API you call most.
Image: Ubuntu 24.04 LTS
Size: Basic, Regular, $6/month (1 vCPU, 1GB RAM, 25GB SSD)
Authentication: SSH Key (paste your public key from ~/.ssh/id_ed25519.pub)

Never choose password authentication. SSH keys only. This is non-negotiable. I'll explain why in Step 2.

The Droplet spins up in about 60 seconds. Grab the IP address from the dashboard. Test your connection:

ssh root@YOUR_DROPLET_IP

That root prompt means you have a server. Running. Waiting for your code. For $6/month, you now own a machine that will run 24/7 whether or not you're watching.

Step 2: Lock It Down (10 Minutes)

I skipped this step the first time. A week later, I checked the auth log and found 3,000 brute-force SSH attempts from IPs I'd never seen. Nothing was compromised (SSH keys are strong), but it was a wake-up call.

Ten minutes of security setup prevents real problems:

# Create a deploy user (never run your bot as root)
adduser --disabled-password agent
usermod -aG sudo agent

# Copy your SSH key to the new user
mkdir -p /home/agent/.ssh
cp /root/.ssh/authorized_keys /home/agent/.ssh/
chown -R agent:agent /home/agent/.ssh
chmod 700 /home/agent/.ssh
chmod 600 /home/agent/.ssh/authorized_keys

# Disable root login and password auth
sed -i 's/PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
systemctl restart sshd

# Enable the firewall
ufw allow OpenSSH
ufw --force enable

Log out. Reconnect as the agent user:

ssh agent@YOUR_DROPLET_IP

You now have a locked-down server. Root login disabled, password auth disabled, firewall active. This is what separates a production server from a tutorial project.

Step 3: Set Up Python (3 Minutes)

Ubuntu 24.04 ships with Python 3.12. Set up a virtual environment:

sudo apt update && sudo apt install -y python3-venv python3-pip

mkdir -p ~/my-agent
cd ~/my-agent

python3 -m venv venv
source venv/bin/activate

pip install aiohttp websockets python-dotenv

Always use a venv. I learned this the hard way when a system-level pip install broke apt's Python dependencies on my first Droplet. Took me an hour to untangle. A venv takes 10 seconds to create and prevents that entirely.

Step 4: Deploy Your Agent (5 Minutes)

Here's a production-ready async agent skeleton. This is the same pattern my trading bot used. Replace the inner logic with whatever your agent does:

# ~/my-agent/agent.py
import asyncio
import signal
import logging
from datetime import datetime

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.FileHandler("agent.log"),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

shutdown_event = asyncio.Event()

def handle_signal(sig, frame):
    logger.info(f"Received signal {sig}, shutting down gracefully...")
    shutdown_event.set()

async def process_tick(data: dict):
    """Your agent logic goes here."""
    logger.info(f"Processing: {data}")

async def run_agent():
    """Main agent loop. Replace with your real logic."""
    logger.info("Agent started")

    cycle = 0
    while not shutdown_event.is_set():
        try:
            await process_tick({"cycle": cycle, "ts": datetime.utcnow().isoformat()})
            cycle += 1
            await asyncio.sleep(10)

        except Exception as e:
            logger.error(f"Agent error: {e}")
            await asyncio.sleep(5)

    logger.info("Agent stopped cleanly")

def main():
    signal.signal(signal.SIGTERM, handle_signal)
    signal.signal(signal.SIGINT, handle_signal)
    asyncio.run(run_agent())

if __name__ == "__main__":
    main()

Create a .env file for configuration:

# ~/my-agent/.env
API_KEY=your_api_key_here
POLL_INTERVAL=10
LOG_LEVEL=INFO

Test it on the Droplet:

cd ~/my-agent
source venv/bin/activate
python3 agent.py

You should see log output. Press Ctrl+C to stop. If it runs for 30 seconds clean, you're ready for the part most tutorials skip.

How do you keep a Python agent running 24/7 on a Droplet?

Use systemd. This is the critical step separating "I deployed my bot" from "my bot runs in production."

Running your agent in screen or tmux kills it when SSH disconnects, server reboots, or when your code crashes at 3am with nobody watching. I lost 6 hours of trading signals before learning about systemd.

systemd handles three essential tasks:

Automatically restarts your agent if it crashes
Starts it on server reboot without manual intervention
Manages logs for debugging and audit trails

Create a service file:

sudo tee /etc/systemd/system/my-agent.service << 'EOF'
[Unit]
Description=My Python Agent
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=agent
WorkingDirectory=/home/agent/my-agent
Environment=PATH=/home/agent/my-agent/venv/bin:/usr/bin
EnvironmentFile=/home/agent/my-agent/.env
ExecStart=/home/agent/my-agent/venv/bin/python3 agent.py
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
EOF

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable my-agent
sudo systemctl start my-agent

Check it:

sudo systemctl status my-agent

active (running). Your agent now survives reboots, crashes, and SSH disconnects. Close your laptop. Go to sleep. It keeps running.

Step 6: What are the common mistakes after deployment?

I hit all of these. You'll hit at least two.

1. "ModuleNotFoundError" after deploy

systemd runs its own environment. If ExecStart points to python3 instead of /home/agent/my-agent/venv/bin/python3, it uses the system Python which doesn't have your packages. Exact path. Every time.

2. Agent dies silently after 6 hours

An unhandled exception in the main loop. Without the try/except wrapper, the agent crashes, systemd restarts it, it crashes again on the same data, and you hit the restart rate limit. The backoff sleep is not optional.

3. Disk fills up from logs

journalctl handles systemd logs, but your agent.log file grows without bounds. I discovered this after a 2GB log file ate my 25GB disk. Add log rotation:

sudo tee /etc/logrotate.d/my-agent << 'EOF'
/home/agent/my-agent/agent.log {
    daily
    rotate 7
    compress
    missingok
    notifempty
}
EOF

4. SSH disconnect kills the agent

Only happens if you started with python3 agent.py & in a shell session. systemd doesn't have this problem. If you're still running bots in tmux: stop. Use systemd.

How do you deploy code changes to your running agent?

When you update your agent code:

# From your local machine
scp agent.py agent@YOUR_DROPLET_IP:~/my-agent/

# Restart the service
ssh agent@YOUR_DROPLET_IP "sudo systemctl restart my-agent"

# Verify clean start
ssh agent@YOUR_DROPLET_IP "sudo journalctl -u my-agent -n 10 --no-pager"

Three commands. Under 10 seconds. I started with scp and switched to git-based deploys after the third time I forgot to push a dependency file. For a single agent, scp is fine.

What should you verify before going live?

Before you trust your agent with real work or money:

[ ] SSH key auth only (password auth disabled)
[ ] Firewall active (ufw status shows SSH allowed, everything else denied)
[ ] Non-root user running the agent
[ ] systemd service with Restart=always
[ ] Error handling in the main loop (catch, log, backoff, continue)
[ ] Signal handlers for SIGTERM/SIGINT
[ ] Log rotation configured
[ ] Monitoring (check logs daily until stable)

If your agent handles money, also add: position limits, stop-losses, health check endpoints, and alert notifications. I documented the full trading bot architecture, risk management, and signal detection patterns in my Polymarket trading bot guide.

When do you outgrow a $6 Droplet?

I migrated away from DigitalOcean when my trading strategy demanded sub-10ms round-trip latency to Polymarket's CLOB. The Amsterdam region provided 5-12ms, which worked for my initial strategy. When I needed 3-5ms consistency, I switched to a specialized VPS provider colocated near the exchange.

That migration took 4 months. It only mattered because I was doing latency arbitrage where 100ms cost me money.

For most Python agents, you will not outgrow a $6 Droplet for years.

You'll know it's time to upgrade when:

Milliseconds matter: Your agent's profitability depends on execution speed
Multiple agents: You're running 4+ processes and hitting CPU/RAM limits
GPU inference: Your agent runs ML models that need a GPU
Compliance: You need certifications or regions DO doesn't offer

Until then, every month you spend more than $6 on infrastructure for a single Python agent is money you didn't need to spend.

Start Here

If you've been running your bot on Lambda hitting execution timeouts, paying $25/month for a Heroku worker dyno, or avoiding deployment because AWS console complexity paralyzes you:

Sign up for DigitalOcean (claim $200 in free credits for 60 days, enough to test this entire setup cost-free)
Follow the 6-step process above (30 minutes total, no dependencies)
Replace the example agent skeleton with your actual logic
Run through the production checklist and monitor for 48 hours

Result: $6/month. 30 minutes setup time. Your agent running 24/7 on infrastructure you control completely.

For trading bot builders: Start with my guide on building a Polymarket trading bot to understand signal detection, order placement, risk management, and profitability metrics. Then return here for the deployment infrastructure. My bot achieved 69.6% win rate on this exact setup before migrating for latency reasons.

For general AI agents: This Droplet setup works equally well for data crawlers, scheduled scrapers, webhook processors, LLM orchestration systems, and autonomous agents. The pattern applies to any Python process that needs 24/7 uptime without paying cloud premium prices.

What are you deploying? I'm curious what agents people are running on VPS these days.

Sources

DigitalOcean Droplet Documentation (DigitalOcean)
systemd Service Units (freedesktop.org)
Python asyncio Documentation (Python Software Foundation)
DigitalOcean SSH Key Setup (DigitalOcean)

Claude Code Hooks Tutorial: 4 Production Patterns for Code Guardrails

Chudi Nnorukam — Fri, 10 Apr 2026 23:51:56 +0000

Originally published at chudi.dev

I ran a secret scanner on every project for months before I realized Claude Code was writing .env files with real credentials baked in. Not because it was malicious. Just because the context had a key, and it needed a value.

The fix took five minutes once I knew hooks existed.

Claude Code hooks let you run any shell command automatically when tool events fire. Before a file gets written, after a bash command runs, when Claude finishes a task. You get full context about what's happening via stdin, and for PreToolUse hooks, you can block the operation entirely.

This is the guide I wish I had when I started.

What Are Claude Code Hooks and Why Do You Need Them?

Claude Code is an autonomous agent. It reads files, writes code, runs commands, and makes decisions faster than you can review each one. That autonomy is the point. But it creates a gap: how do you enforce standards without reviewing every action manually?

Hooks close that gap. They're your enforcement layer — running in the background, checking every operation against your rules, and either approving it or blocking it before any damage is done.

Think of them as middleware for your AI agent. The tool fires an event, your hook intercepts it, does its check, and returns a decision. If the hook returns {"continue": false}, Claude stops. If it returns {"continue": true} (or nothing), Claude proceeds.

What Are the Four Claude Code Hook Events?

Claude Code exposes four events you can hook into (see official hooks docs):

PreToolUse — fires before any tool runs. You can inspect the tool input and block execution. This is where guardrails live.

PostToolUse — fires after a tool completes. You get the tool output. Use this for logging, formatting, or triggering follow-on actions.

Notification — fires when Claude sends a notification (waiting for input, task complete, etc.). Good for custom alerts.

Stop — fires when the agent finishes a task. Use this for cleanup, summaries, or Slack notifications.

There's also SubagentStop which fires when a subagent finishes, if you're running parallel agents.

Where Do You Configure Claude Code Hooks?

Everything goes in ~/.claude/settings.json. The structure looks like this:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          {
            "type": "command",
            "command": "/Users/you/scripts/scan-secrets.sh"
          }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Edit",
        "hooks": [
          {
            "type": "command",
            "command": "/Users/you/scripts/auto-format.sh"
          }
        ]
      }
    ],
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "/Users/you/scripts/notify-complete.sh"
          }
        ]
      }
    ]
  }
}

The matcher field is a regex matched against the tool name. Write|Edit matches both the Write tool and the Edit tool. Leave it out to match all tools for that event.

What your hook receives

Every hook gets a JSON object via stdin. For a PreToolUse hook on the Write tool, it looks like this:

{
  "session_id": "abc123",
  "hook_event_name": "PreToolUse",
  "tool_name": "Write",
  "tool_input": {
    "file_path": "/Users/you/project/.env",
    "content": "STRIPE_SECRET_KEY=sk_live_abc123..."
  }
}

For a Bash tool, tool_input contains command instead of file_path. For Edit, you get file_path, old_string, and new_string. The shape matches the tool's schema.

PostToolUse hooks also get tool_response — the actual output the tool returned.

What your hook must return

This is the part that trips people up.

If your hook writes anything to stdout, it must be valid JSON. The Claude Code protocol reads stdout as structured data. If you print plain text, you'll get protocol errors.

#!/bin/bash
# WRONG - will break the protocol
echo "Scanning for secrets..."
echo '{"continue": true}'

# RIGHT - suppress all non-JSON output
exec 2>/dev/null
echo '{"continue": true}'

The valid return fields are:

{
  "continue": true,
  "suppressOutput": false,
  "decision": "approve",
  "reason": "No secrets found"
}

continue: false blocks the tool. suppressOutput: true hides the hook output from Claude's context. reason gets shown in the UI when you block.

If your script exits with code 0 and returns nothing, Claude proceeds. If it exits non-zero, Claude treats it as a blocking error.

Hook 1: Secret scanner

This is the one I wish I'd had from day one. It runs before any Write or Edit and blocks the operation if it finds credentials.

#!/bin/bash
# ~/.claude/scripts/scan-secrets.sh

exec 2>/dev/null

INPUT=$(cat)
CONTENT=$(echo "$INPUT" | python3 -c "
import json, sys
d = json.load(sys.stdin)
ti = d.get('tool_input', {})
print(ti.get('content', '') + ti.get('new_string', ''))
" 2>/dev/null || echo "")

# Check for common secret patterns
PATTERNS=(
  'sk_live_[A-Za-z0-9]+'
  'xoxb-[A-Za-z0-9-]+'
  'AKIA[A-Z0-9]{16}'
  'ghp_[A-Za-z0-9]{36}'
  'rpa_[A-Za-z0-9]+'
  'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9'
)

for pattern in "$\{PATTERNS[@]}"; do
  if echo "$CONTENT" | grep -qE "$pattern"; then
    echo "{\"continue\": false, \"reason\": \"Blocked: potential secret detected matching pattern $pattern\"}"
    exit 0
  fi
done

echo '{"continue": true}'

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Write|Edit|NotebookEdit",
        "hooks": [{ "type": "command", "command": "/Users/you/.claude/scripts/scan-secrets.sh" }]
      }
    ]
  }
}

Now every file write goes through the scanner. If it finds a Stripe live key, Slack token, or AWS key, it blocks with a reason Claude can read and explain.

Hook 2: Auto-formatter

After Claude edits a TypeScript or Python file, run the formatter automatically. No more "Claude wrote valid code but wrong indentation."

#!/bin/bash
# ~/.claude/scripts/auto-format.sh

exec 2>/dev/null

INPUT=$(cat)
FILE=$(echo "$INPUT" | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(d.get('tool_input', {}).get('file_path', ''))
" 2>/dev/null || echo "")

if [[ -z "$FILE" ]]; then
  echo '{"continue": true}'
  exit 0
fi

case "$FILE" in
  *.ts|*.tsx)
    command -v prettier &>/dev/null && prettier --write "$FILE" &>/dev/null
    ;;
  *.py)
    command -v ruff &>/dev/null && ruff format "$FILE" &>/dev/null
    ;;
esac

echo '{"continue": true}'

PostToolUse on Edit:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [{ "type": "command", "command": "/Users/you/.claude/scripts/auto-format.sh" }]
      }
    ]
  }
}

The formatter runs silently after every edit. Claude's next read of the file sees clean, formatted code without any back-and-forth.

Hook 3: Slack notification on task complete

I work with Claude running in the background while I do other things. The Stop hook lets me know when it's done without watching the terminal.

#!/bin/bash
# ~/.claude/scripts/notify-complete.sh

exec 2>/dev/null

TOKEN="$\{SLACK_BOT_TOKEN:-}"
CHANNEL="$\{SLACK_NOTIFY_CHANNEL:-}"

if [[ -z "$TOKEN" || -z "$CHANNEL" ]]; then
  echo '{"continue": true}'
  exit 0
fi

INPUT=$(cat)
SESSION=$(echo "$INPUT" | python3 -c "
import json, sys
print(json.load(sys.stdin).get('session_id', 'unknown'))
" 2>/dev/null || echo "unknown")

curl -sf -X POST https://slack.com/api/chat.postMessage \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"channel\":\"$CHANNEL\",\"text\":\":white_check_mark: Claude finished task (session: $SESSION)\"}" \
  > /dev/null

echo '{"continue": true}'

{
  "hooks": {
    "Stop": [
      {
        "hooks": [{ "type": "command", "command": "/Users/you/.claude/scripts/notify-complete.sh" }]
      }
    ]
  }
}

Now when a long refactor finishes, my phone buzzes.

Hook 4: Approval gate for destructive bash commands

This one requires more care. Some bash commands are irreversible — dropping databases, deleting branches, modifying production configs. The PreToolUse hook on Bash lets you intercept these.

#!/bin/bash
# ~/.claude/scripts/approve-destructive.sh

exec 2>/dev/null

INPUT=$(cat)
CMD=$(echo "$INPUT" | python3 -c "
import json, sys
print(json.load(sys.stdin).get('tool_input', {}).get('command', ''))
" 2>/dev/null || echo "")

DESTRUCTIVE_PATTERNS=(
  'rm -rf'
  'drop table'
  'DROP TABLE'
  'git push --force'
  'git reset --hard'
  'kubectl delete'
  'systemctl stop'
)

for pattern in "$\{DESTRUCTIVE_PATTERNS[@]}"; do
  if echo "$CMD" | grep -qF "$pattern"; then
    echo "{\"continue\": false, \"reason\": \"Blocked: '$pattern' requires explicit approval. Run the command manually if intended.\"}"
    exit 0
  fi
done

echo '{"continue": true}'

This doesn't ask for approval interactively — that would deadlock. Instead it blocks and explains. You review the command, run it yourself if it's correct, and Claude continues from there.

Gotchas that cost me time

Shell profile output breaks hooks. If your .zshrc or .bashrc prints anything (greeting messages, nvm output, conda activation), it will pollute the hook stdout. Either suppress it or use exec 2>/dev/null at the top of every hook script.

Hooks run in a non-interactive shell. Your PATH, aliases, and shell functions aren't loaded. Use full absolute paths to commands (/opt/homebrew/bin/prettier, not prettier).

PreToolUse latency adds up. If your hook takes 500ms and Claude runs 50 Edit operations, that's 25 extra seconds. Profile your hooks. Secret scanning should be under 50ms. If it's slow, check for regex backtracking.

The matcher is a regex, not a glob. Write|Edit works. Write* does not.

Empty stdout is fine, but non-JSON stdout breaks things. Add exec 2>/dev/null to redirect stderr, then only ever echo valid JSON.

My actual settings.json hooks section

This is what I run across all projects:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Write|Edit|NotebookEdit",
        "hooks": [{ "type": "command", "command": "/Users/chudinnorukam/.claude/scripts/scan-secrets.sh" }]
      },
      {
        "matcher": "Bash",
        "hooks": [{ "type": "command", "command": "/Users/chudinnorukam/.claude/scripts/approve-destructive.sh" }]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [{ "type": "command", "command": "/Users/chudinnorukam/.claude/scripts/auto-format.sh" }]
      }
    ],
    "Stop": [
      {
        "hooks": [{ "type": "command", "command": "/Users/chudinnorukam/.claude/scripts/notify-complete.sh" }]
      }
    ]
  }
}

Four hooks. They cover the 90% case: secrets, destructive commands, formatting, and completion notifications. Everything else I handle manually because the hook overhead isn't worth it for low-frequency events.

Where to go from here

Hooks are composable. You can chain multiple hooks on the same event. You can use them to log every tool call to a file for auditing. You can build approval workflows that post to Slack and wait for a reply before proceeding.

The pattern I'm building toward: a full audit log of every Claude action, with replay capability. Every Write, Edit, and Bash call gets logged with the session ID, timestamp, and tool input. When something goes wrong, I can reconstruct exactly what happened and in what order.

That's the next post. For now, start with the secret scanner. It's the one hook that pays for itself the first time it catches something.

If you're using Claude Code for real projects, you already know the trust issue. You can't review every edit. Hooks are how you stop trusting blindly and start trusting with guardrails.

Sources

Why DA Is Irrelevant for AI Citations (Data from 7 Site Audits)

Chudi Nnorukam — Fri, 10 Apr 2026 23:51:25 +0000

Originally published at chudi.dev

Domain authority does not predict AI citations. Ahrefs has DA 92 and gets cited by AI platforms only 5% of the time. citability.dev launched with DA under 10 and achieved a 15% citation rate on day one. That 3x gap, between the most authoritative domain and a brand-new site with almost no backlinks, is not noise. It is the clearest possible signal that AI source selection runs on completely different rules than Google rankings.

I ran AI Visibility Readiness audits on 7 websites and tested each against ChatGPT, Perplexity, and Claude. The finding is consistent: DA has zero predictive value for whether AI will cite your URL. What predicts citations is content structure, freshness signals, and original data that AI cannot source elsewhere.

What Does Domain Authority Actually Measure?

Domain authority is a Moz metric that scores your backlink profile on a 1-to-100 logarithmic scale. More high-quality sites linking to you means a higher DA. Google uses backlink graphs as one major ranking signal, so DA became a widely-used proxy for "how authoritative is this site?"

The problem is the assumption buried in that proxy: that what works for Google works for AI. It does not.

Google PageRank is a graph algorithm. Trustworthiness flows through backlink networks. A site vouched for by high-authority domains earns authority itself.

AI answer engines do not use link graphs at all. They select sources based on whether the content is extractable, verifiable, and attributable. None of those three properties have anything to do with who links to you.

Backlinks are social proof for a graph algorithm. AI needs structured, dated, original content. These are completely different inputs to completely different systems.

What Does Our Benchmark Data Show?

The table below shows DA scores, infrastructure readiness results from the AI Visibility Readiness Framework, and measured citation rates across ChatGPT, Perplexity, and Claude.

Site	DA	Infrastructure	AI Visible	AI Cited
reddit.com	97	Not ready	Untested	Untested
x.com	96	Not ready	Untested	Untested
medium.com	95	Not ready	Untested	Untested
ahrefs.com	92	Foundation-ready	100%	5%
semrush.com	91	Foundation-ready	Partial	Partial
chudi.dev	28	Foundation-strong	25%	0%
citability.dev	under 10	Foundation-strong	44%	15%

Sort by DA. No pattern emerges. The three highest-DA sites in the dataset failed infrastructure readiness entirely. The lowest-DA site has the highest citation rate.

citability.dev vs. chudi.dev is the most instructive comparison. chudi.dev has DA 28 with years of content and backlinks, yet 0% citation rate. citability.dev has DA under 10 and launched with a focused content structure and original benchmark data. The newer, lower-authority site outperformed on citations because it was built for AI extraction from the start.

Reddit, X, and Medium fail infrastructure checks for similar reasons. Reddit blocks AI crawlers in robots.txt. X serves content through JavaScript that most AI crawlers cannot execute. Medium routes content through a platform domain rather than author domains, fragmenting citation attribution. These are not problems backlinks can fix.

How Do AI Platforms Select Sources?

There are two pathways through which AI cites a URL, and only one of them is influenced by your content decisions.

The first pathway is training data. AI models internalize billions of pages during training. Ahrefs is in that training data at massive scale. When you ask ChatGPT about SEO tools, it knows Ahrefs without fetching anything. That is why Ahrefs is 100% visible despite low citation rates: the AI already knows everything it needs to know about them. Training data visibility does not require infrastructure. It requires being large and old.

The second pathway is retrieval-augmented generation (RAG) and live fetching. When an AI platform needs to answer a question and its training data is insufficient or potentially stale, it fetches external sources. This is where infrastructure determines outcome.

For RAG citations, three factors drive selection. First, the content must be machine-readable: no JavaScript blocking, clear HTML structure, structured data markup. Second, the content must appear current: dateModified schema, recent publication dates, and references to recent data. Research from Semrush indicates that 95% of ChatGPT citations come from recently updated content. Third, the content must contain specific claims the AI cannot make from memory alone. Original data, proprietary benchmarks, and recent statistics create citation necessity.

The 12% figure captures the scale of this divergence: only 12% of URLs cited by LLMs appear in Google's top 10 results for the same queries. If you are optimizing purely for Google, you are optimizing for a system with only 12% overlap with AI citation behavior.

What Should You Build Instead of Backlinks?

The benchmark data points to three infrastructure investments that directly increase AI citation rates. None of them involve link acquisition.

Answer-first content structure. AI extraction systems scan pages for the first concise, factual statement they can use. If your answer is buried in paragraph 4 behind context-setting, the AI may not reach it, or may retrieve a weaker version of your claim.

The fix is mechanical: move the direct answer to the first 100 words. Use question-based H2 headings that match how users phrase queries to AI. Keep paragraphs under 40 words. Remove qualifying language from opening statements. The opening paragraph of this article is built on this principle. The claim is in sentence one. Every word after it supports and extends that claim.

For a complete guide to this technique, see the AEO guide.

dateModified schema with substantive updates. Pages with Article or TechArticle schema that includes a valid dateModified field receive roughly 1.8x more AI citations than pages without. But the signal only works when backed by real content changes. Updating the date without changing the content is a pattern AI platforms are learning to discount.

The safe approach: update content quarterly with at least 100 words of substantive new material, new statistics, or revised conclusions. Only update dateModified when the change is real. Fake freshness signals have a short shelf life and create downside risk on Google rankings.

Original data that creates citation necessity. AI has internalized most widely available information from training. When AI encounters a question where its training data runs out, it fetches. Original data forces fetching because the AI has no other source for it.

The table above is an example. The specific DA-versus-citation data from this 7-site audit exists only here. When AI references it, it must cite this source. That is the mechanism. Publish data no one else has published, and AI must come to you for it.

Pages with inline statistics receive 40% more AI citations on average. Benchmark tables, audit results, survey data, and comparison analyses all qualify. One piece of original research per month creates sustained citation opportunities that no backlink campaign can replicate.

Does SEO Still Matter?

Yes, with a precise qualification. Google AI Overviews show 76% overlap with traditional top 10 search results. If you want to appear in Google AI Overviews, traditional SEO still applies. High DA still helps with that specific product.

But for standalone AI platforms, primarily ChatGPT and Perplexity, the 12% divergence means Google optimization is largely orthogonal to AI citation. You need both strategies, and they require different optimization layers.

The good news: the infrastructure changes that improve AI citability also strengthen traditional SEO in parallel. Answer-first content improves featured snippet eligibility. Structured data enables rich results. Content freshness signals help for queries that trigger Google's freshness algorithm. The overlap is real, even if the primary ranking factors diverge.

The mistake is assuming that building backlinks alone will carry you into AI citations. It will not. The game has changed. A new site with DA under 10 and the right content structure outperforms DA 92 on AI citations. That is not an anomaly. It is the new default.

Where to Start

If you have been allocating budget to link building with the assumption it will help AI visibility, here is a more direct path.

Run a free infrastructure scan at citability.dev/assess. It checks 10 baseline signals in under 60 seconds: robots.txt, sitemap, structured data, answer-first content, freshness signals, and more. The scan tells you exactly where your site falls short and which fixes will have the highest impact.

Then read the full benchmark breakdown in I Audited 7 Websites for AI Citability, which walks through each site's specific failures and what was done to improve the results.

Domain authority was a useful shorthand for Google trustworthiness. It is not a shorthand for AI trustworthiness. The infrastructure that makes AI cite you is different, measurable, and largely within your control right now.

Sources

Domain Authority - Moz (Moz)
AI Visibility Readiness Framework (GitHub)
Semrush: How AI Evaluates Content Freshness (Semrush)
Schema.org - dateModified (Schema.org)

What Actually Predicts Whether AI Cites Your Website (Data from 7 Site Audits)

Chudi Nnorukam — Fri, 10 Apr 2026 23:51:17 +0000

Originally published at chudi.dev

Domain authority does not predict whether AI will cite your website. I audited 7 websites for AI citability, and the results challenge nearly everything the SEO industry assumes about AI search visibility.

Ahrefs (DA 92) was cited by AI only 5% of the time despite 100% visibility. A brand-new site with DA under 10 achieved a 15% citation rate. Sites with millions of daily visitors failed basic infrastructure checks. The factors that actually predicted citations had nothing to do with backlinks or traffic.

Here is what the data showed.

TL;DR

AI citability is whether AI answer engines include your URL as a source, not just mention your brand.

Domain authority has zero correlation with AI citation rates
Ahrefs (DA 92) is 100% AI-visible but only 5% cited
citability.dev (DA under 10) achieved 15% citation rate, outperforming DA 90+ sites
Reddit, Medium, and X all failed basic AI infrastructure checks
The three strongest predictors: answer-first content, dateModified schema, original data
Only 12% of URLs cited by LLMs appear in Google's top 10 results

The Audit: 7 Sites, 3 AI Platforms, 10 Infrastructure Checks

I used the AI Visibility Readiness (AVR) framework to run infrastructure audits on 7 websites. Each site was checked for 10 signals that AI crawlers use to discover and parse content: robots.txt, sitemap.xml, answer-first content, content freshness, structured data (JSON-LD), meta descriptions, canonical URLs, HTTPS, heading hierarchy, and social sharing readiness.

Then I queried ChatGPT, Perplexity, and Claude with questions each site should be able to answer. I tracked two metrics:

AI Visibility: Does the AI mention the brand when asked?
AI Citability: Does the AI include a URL from the site as a cited source?

The Results

Site	Domain Authority	AI Infrastructure	AI Visibility	AI Citability
ahrefs.com	92	Foundation-ready	100%	5%
semrush.com	91	Foundation-ready	Partial	Partial
chudi.dev	28	Foundation-strong	25%	0%
citability.dev	Under 10	Foundation-strong	44%	15%
reddit.com	97	Not ready	Untested	Untested
medium.com	95	Not ready	Untested	Untested
x.com	96	Not ready	Untested	Untested

The three highest-DA sites (Reddit 97, X 96, Medium 95) all failed basic infrastructure readiness. They are missing structured data, answer-first content, or proper AI crawler permissions. These sites get cited constantly by AI, but not because of their infrastructure. They get cited because AI training data includes their content at massive scale.

The most striking result: citability.dev, a site with DA under 10 and fewer than 100 backlinks, achieved a 15% citation rate. That is 3x higher than Ahrefs (DA 92). The difference is not authority. The difference is original benchmark data and answer-first content structure.

For everyone else, infrastructure is the gate.

Does High Domain Authority Mean AI Will Cite You?

No. The data is clear: DA has zero predictive power for AI citations.

Ahrefs has a DA of 92, one of the highest in the SEO industry. Every AI platform recognizes the brand instantly. Ask ChatGPT "what is Ahrefs?" and you get a detailed, accurate answer. That is 100% AI visibility.

But ask ChatGPT "what tools should I use for keyword research?" and Ahrefs gets mentioned but rarely linked. The AI knows the brand exists. It does not need to cite the source. That is the visibility-citation gap, and it exists because AI systems already have the information internalized from training data.

Citation happens when AI needs your content as a source for a specific claim. That requires your content to be structured in a way the AI can extract and attribute.

What Infrastructure Do AI Crawlers Actually Need?

The 10-check audit revealed a clear pattern. Sites that passed 8+ infrastructure checks had measurably higher visibility scores. Sites that failed basic checks were invisible regardless of their authority.

The Baseline Signals

robots.txt and sitemap.xml are table stakes. Every site in the audit had these, but the content of each matters. Reddit's robots.txt blocks several AI crawlers. Medium's sitemap is auto-generated but does not include all content pages. Simply having the files is not enough.

HTTPS and canonical URLs are similarly baseline. Every audited site passed these. They are necessary but not differentiating.

The Differentiating Signals

Three signals separated the visible sites from the invisible ones:

Answer-first content. Pages that led with a direct answer in the first 100 words scored dramatically higher on AI extractability. This matches research showing AI systems extract the first clear, unqualified statement they find on a page. Generic marketing copy, hero images, and navigation-heavy layouts all push the answer down, making it harder for AI to extract.

Structured data (JSON-LD). Sites with Article, FAQPage, and HowTo schema gave AI systems explicit context about content purpose and structure. The chudi.dev audit showed 9 schema types across pages, including TechArticle with dateModified, FAQPage with 5+ questions per article, and Person schema with expertise signals. This machine-readable layer is what lets AI systems understand your content without parsing ambiguous HTML.

Content freshness. Pages with dateModified in their schema received 1.8x more AI citations than pages without, according to Semrush research. This aligns with another finding: 95% of ChatGPT citations come from recently published or updated content. Stale content without date signals gets deprioritized.

Which Sites Get Cited vs Just Mentioned?

The gap between being mentioned and being cited is the central problem in AI visibility.

Platform-Specific Citation Behavior

Each AI platform has different citation preferences:

Perplexity cites approximately 6.6 sources per answer and heavily indexes Reddit (46.7% of its top cited sources)
ChatGPT cites only about 2.6 sources per answer and shows strong Wikipedia preference (7.8% of all citations)
Google Gemini cites about 6.1 sources per answer with 76% overlap with Google's traditional top 10

This means the optimization strategy differs by platform. Perplexity rewards breadth of presence across forums and communities. ChatGPT rewards being on established reference sources. Google AI Overviews still correlates heavily with traditional SEO rankings.

The 12% Divergence

Only 12% of URLs cited by LLMs appear in Google's top 10 search results for the same queries. This is the statistic that should reframe how you think about AI search: ranking on Google and getting cited by AI are largely separate problems.

The exceptions are Google AI Overviews, which show 76% overlap with traditional rankings. But ChatGPT and Perplexity operate on fundamentally different source selection algorithms.

The Three Factors That Actually Predict AI Citations

Based on the audit data and corroborating research, three factors had the strongest predictive power:

1. Answer-First Content Structure

Pages where the direct answer appears in the first 100 words get extracted more often. This means:

Lead with the answer, not the question
Keep opening paragraphs to 25-40 words
Use clear, factual statements without qualifying language
Structure H2 headings as questions the reader would ask AI

The qualifying language point is critical. Phrases like "it depends," "in many cases," or "it can be argued" signal uncertainty. AI systems prefer definitive statements they can extract as answers.

2. dateModified Schema with Substantive Updates

The 1.8x citation lift from dateModified schema is real, but only when paired with actual content updates. Google penalizes fake freshness signals, meaning you cannot just bump the date without changing anything. The safe approach:

Update content quarterly with new data and statistics
Add at least 100 words of substantive new content per refresh
Reference current-year sources and data points
Only update dateModified when the refresh is genuine

3. Inline Statistics and Original Data

Pages with inline statistics get 40%+ more AI citations. This makes sense: AI systems need claims they can attribute, and specific numbers are the easiest claims to attribute to a source.

Original data is even more powerful. If your page contains data that does not exist elsewhere, AI has no choice but to cite you when referencing it. This is why I publish audit results and benchmark data publicly. The comparison table at the top of this article is data that exists nowhere else.

What This Means for Your Site

The path from invisible to cited is not about building more backlinks or increasing your DA. It is about making your content technically extractable by AI systems.

The checklist is short:

Check your infrastructure. Run a free scan to verify the 10 baseline signals.
Restructure your content. Lead with answers. Use question-based headings. Add FAQ and HowTo schema.
Publish original data. Give AI systems something they can only get from you.
Keep content fresh. Update quarterly with substantive changes and current statistics.
Test across platforms. Query ChatGPT, Perplexity, and Claude with questions your site should answer. Track citation rates over time.

The sites that get cited in 2026 will not be the ones with the highest DA. They will be the ones whose content is structured so AI systems can extract, trust, and attribute it.

Sources

Evergreen Media: Answer Engine Optimization (Evergreen Media)
Semrush: Answer Engine Optimization Guide (Semrush)
AI Visibility Readiness Framework (citability.dev)

How to Structure Content So AI Actually Cites Your URL (Technical Guide)

Chudi Nnorukam — Fri, 10 Apr 2026 23:50:46 +0000

Originally published at chudi.dev

AI answer engines do not extract content the same way Google indexes it. Getting cited requires specific structural patterns in your HTML, your schema markup, and even your sentence construction. This guide covers each pattern with implementation details.

The core principle: AI systems scan your page top-down and extract the first clear, attributable claim they find. Everything that delays or obscures that claim reduces your citation probability.

TL;DR

Place the direct answer in the first 100 words
Use question-based H2 headings matching what users ask AI
Write 25-40 word paragraphs with inline statistics
Add FAQPage, HowTo, and Article JSON-LD schema
Update dateModified quarterly with 100+ words of real changes
Remove qualifying language that signals uncertainty

Why Does Content Structure Matter for AI Citations?

Google reads your entire page, follows links, and uses PageRank to determine authority. AI answer engines work differently. They scan for extractable claims they can include in a response and attribute to a source.

This means two pages with identical information can have completely different citation rates. The page that structures its content for extraction gets cited. The page that buries the same information below navigation, marketing copy, or lengthy preambles gets skipped.

The structural patterns below are not theoretical. They are derived from audit data across 6 websites and corroborated by Semrush and Evergreen Media research on AI citation behavior.

How Should I Write the First 100 Words?

The first 100 words of your page determine whether AI extracts your content. This is the highest-impact structural change you can make.

The rule: State the direct answer to your page's primary question in the first sentence or paragraph. No preamble. No credentials. No "in this article, we will explore." The answer.

Why it works: AI systems process pages sequentially. The first clear, unqualified factual statement on the page becomes the primary extraction candidate. If your answer appears in paragraph 4 after context-setting, the AI may have already found a better source.

What to remove from your introduction:

"In this article, we will..." framing
Author credentials or company background
Statistics about the topic's importance
Rhetorical questions

What to keep:

The direct answer to the page topic
One supporting data point
A clear statement with no qualifying language

Compare these two openings:

Before (low extractability): "With the rapid growth of AI-powered search engines, many website owners are wondering how to optimize their content. In this comprehensive guide, we will explore the key factors that determine whether AI systems cite your website."

After (high extractability): "AI answer engines cite pages that place a direct answer in the first 100 words, use question-based headings, and include inline statistics with attribution. Pages that bury answers below marketing copy get skipped regardless of their domain authority."

The second version contains three extractable claims in two sentences. The first version contains zero extractable claims in two sentences.

What Heading Structure Do AI Systems Parse?

H2 headings serve as section-level extraction boundaries. AI systems use them to identify which part of the page answers which question. The optimal structure uses questions as headings because they match the exact queries users type into AI platforms.

Why Questions Work Better Than Statements

When a user asks Perplexity "how do I structure content for AI citations?", the AI scans pages for headings that match that query pattern. A heading like "Content Structure Best Practices" is a weak match. A heading like "How Should I Structure Content for AI Citations?" is a direct match.

Question-based headings create a one-to-one mapping between user queries and your content sections. Each H2 becomes a potential extraction point for a specific query.

The Heading Hierarchy

H1: Page title (one per page, states the topic)
H2: Major questions the page answers (5-8 per article)
H3: Sub-questions or supporting points within each H2 section
H4: Implementation details or examples (use sparingly)

Each H2 section should be self-contained. If AI extracts just that section, it should make sense without reading the rest of the page.

How Long Should Paragraphs Be for AI Extraction?

Keep paragraphs to 25-40 words. Each paragraph should contain exactly one claim.

AI systems evaluate individual paragraphs as extraction candidates. A 150-word paragraph containing four different claims forces the AI to parse and separate ideas. A 30-word paragraph containing one clear claim is ready to extract immediately.

Short paragraphs also improve citation attribution. When AI extracts a single claim from a single paragraph, it can confidently attribute that claim to your page. When it extracts a claim from a dense paragraph with multiple ideas, the attribution is less certain, and the AI may choose a cleaner source instead.

This pattern applies to statistics especially. Instead of embedding a number in a long paragraph, give it its own sentence:

Weak: "There are many factors that affect AI citations, and according to recent research, pages with inline statistics tend to perform about 40% better than pages without them, though results may vary."

Strong: "Pages with inline statistics get 40% more AI citations than pages without them."

The strong version is 13 words and one claim. It is extractable, attributable, and unambiguous.

What Structured Data Should I Add?

JSON-LD schema gives AI systems a machine-readable layer that bypasses HTML parsing entirely. Three schema types cover most content patterns AI platforms look for.

FAQPage Schema

FAQPage schema wraps question-answer pairs in a format AI can extract without parsing your page layout. Each question becomes a structured extraction point.

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What content structure gets the most AI citations?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Answer-first structure where the direct answer appears in the first 100 words, followed by supporting evidence."
      }
    }
  ]
}

Add FAQPage schema to any page with 3 or more question-answer patterns. Your FAQ frontmatter or Q&A sections are natural candidates.

HowTo Schema

HowTo schema structures procedural content into numbered steps. AI platforms use this for "how do I..." queries.

{
  "@context": "https://schema.org",
  "@type": "HowTo",
  "name": "How to structure content for AI citations",
  "step": [
    {
      "@type": "HowToStep",
      "position": 1,
      "name": "Write an answer-first introduction",
      "text": "State the direct answer in the first 100 words."
    }
  ]
}

Add HowTo schema to tutorial posts, deployment guides, and any content with sequential steps.

Article Schema with Freshness Signals

Article schema with datePublished and dateModified is the freshness signal AI systems look for. Pages with dateModified schema receive 1.8x more AI citations than pages without it.

{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "How to Structure Content for AI Citations",
  "datePublished": "2026-04-10",
  "dateModified": "2026-04-10",
  "author": {
    "@type": "Person",
    "name": "Your Name"
  }
}

The critical rule: only update dateModified when you make substantive changes. At least 100 new words, updated statistics, or new sections. Google penalizes fake freshness, and AI systems are learning to detect it.

What Language Patterns Reduce Citation Probability?

AI systems prefer definitive statements. Qualifying language signals uncertainty and reduces extraction confidence.

Phrases that hurt citations:

"It depends on..." (signals no clear answer)
"In many cases..." (hedging)
"It could be argued that..." (uncertainty)
"Results may vary..." (disclaimer)
"Arguably the best..." (subjective)

Phrases that help citations:

"X produces Y result." (direct claim)
"Pages with X get 40% more Y." (quantified claim)
"The three factors are..." (enumerated answer)
"This works because..." (causal explanation)

This does not mean you should never use nuance. It means your opening statements and H2-level answers should be definitive. Save qualifications for supporting paragraphs where you add context and caveats.

The first sentence under each H2 heading is your primary extraction point. Make it a clear, factual statement.

How Do I Test Whether My Content Is AI-Extractable?

Testing requires querying actual AI platforms with questions your content should answer.

The 20-Query Test

Write 20 questions across four categories:

Brand queries (5): "What is [your brand]?", "Who makes [product]?"
Category queries (5): "What tools do [your category]?", "Best [category] for [use case]?"
Comparison queries (5): "[Your product] vs [competitor]?", "Difference between [X] and [Y]?"
How-to queries (5): "How do I [task your content covers]?", "Steps to [process you explain]?"

Query each across ChatGPT, Perplexity, and Claude. Record three outcomes per query:

Cited: AI includes your URL as a source
Mentioned: AI references your brand but does not link
Absent: AI does not reference you at all

Your citation rate is cited queries divided by total queries. Track this monthly after structural changes.

Infrastructure Pre-Check

Before testing content, verify your infrastructure passes baseline checks. A free scan at citability.dev checks 10 signals: robots.txt, sitemap.xml, answer-first content, freshness, structured data, meta description, canonical URL, HTTPS, heading hierarchy, and social sharing readiness.

If you fail infrastructure checks, content structure improvements will not help. Fix the baseline first.

What Is the Implementation Priority?

Not all changes have equal impact. Here is the priority order based on citation lift data:

Answer-first content (highest impact, zero cost): Rewrite introductions on your top 10 pages
Structured data (high impact, low effort): Add FAQPage and Article schema with dateModified
Heading restructure (medium impact, medium effort): Convert statement headings to question headings
Paragraph optimization (medium impact, ongoing): Shorten paragraphs to 25-40 words on new content
Language cleanup (lower impact, ongoing): Remove qualifying language from opening statements
Freshness cadence (sustained impact, quarterly): Update top pages with substantive new content

Start with items 1 and 2. They produce the largest citation lift with the least effort. Items 3-6 are ongoing improvements you apply to all new content and gradually retrofit into existing pages.

The sites that get cited by AI in 2026 are not the ones with the best writing or the highest authority. They are the ones whose content is technically structured so AI systems can find the answer, extract the claim, and attribute the source.

Sources

Evergreen Media: Answer Engine Optimization (Evergreen Media)
Semrush: Answer Engine Optimization Guide (Semrush)
Google: Structured Data Documentation (Google)
Google: Featured Snippets Best Practices (Google)
AI Visibility Readiness Framework (citability.dev)