DEV Community: Leo Pechnicki

endpoint-tester Now Supports 12 Frameworks — Here's What Changed in Three Weeks

Leo Pechnicki — Tue, 05 May 2026 07:22:46 +0000

Three weeks ago I published the first version of endpoint-tester — a CLI that scans your source code, discovers API routes, and generates a ready-to-run test suite with zero configuration. The initial release shipped with Express.js as the fully working adapter plus scaffolded support for FastAPI and Spring Boot.

Today the tool supports 12 frameworks across four languages — and every one of them is fully implemented, not scaffolded.

Here's what changed, why it mattered, and how to use the new adapters.

The 12 frameworks

The adapter list is now:

Language	Frameworks
Node.js	Express.js, Fastify, Koa, NestJS
Python	FastAPI, Flask, Django
Go	Gin, Echo, Chi, net/http
JVM	Spring Boot

Auto-detection works for all of them. Point the tool at any project and it reads package.json, requirements.txt, go.mod, or pom.xml to pick the right adapter before it scans a single source file.

What the tool does (quick recap)

# Scan — discover every endpoint in your project
npx endpoint-tester scan ./src

# Generate — produce a ready-to-run test file
npx endpoint-tester generate ./src --format vitest --output ./tests/api.test.ts

The generated tests cover:

Success assertions with method-correct status codes (POST → 201, DELETE → 204, GET → 200)
Auth header tests (valid Bearer token, missing token, malformed token)
Error response tests for body-accepting endpoints (missing required field, wrong type)
Boundary value tests for path parameters (empty string, negative integer, nonexistent ID)

The new adapters in detail

Fastify

The Fastify adapter handles both shorthand registrations and the object-style fastify.route() call, which many Fastify projects mix together:

fastify.get('/users', listUsers);
fastify.route({ method: 'POST', url: '/users', handler: createUser });

Both forms produce identical output in the scan — the generated test doesn't care how the route was registered.

Koa

Koa's router prefix pattern tripped up earlier attempts at parsing because router.prefix('/api/v1') sits on a different line from the actual method registrations. The adapter resolves prefixes before emitting endpoints, so you get /api/v1/users rather than just /users:

const router = new Router();
router.prefix('/api/v1');
router.get('/users', listUsers);
router.post('/users', createUser);

NestJS

NestJS uses class-level controller decorators to define route prefixes. The adapter combines @Controller('users') with method decorators to reconstruct full paths, and it infers parameter names from @Param(), @Query(), and @Body() to improve boundary test generation:

@Controller('users')
export class UsersController {
  @Get(':id')
  findOne(@Param('id') id: string) { ... }

  @Post()
  create(@Body() createUserDto: CreateUserDto) { ... }
}

Go frameworks (Gin, Echo, Chi, net/http)

The Go adapters were the biggest addition. All three major Go HTTP routers use similar registration patterns but differ in their group/prefix APIs:

// Gin
r := gin.Default()
api := r.Group("/api")
api.GET("/users", listUsers)

// Echo
e := echo.New()
g := e.Group("/api")
g.GET("/users", listUsers)

// Chi
r := chi.NewRouter()
r.Route("/api", func(r chi.Router) {
    r.Get("/users", listUsers)
})

The adapters handle all three group/prefix styles and resolve them correctly. net/http support covers http.HandleFunc(), mux.HandleFunc(), and http.Handle() — the stdlib primitives that underpin most Go services that don't pull in a router dependency.

Auto-detection under the hood

The detection logic reads manifest files first, then falls back to import scanning if the manifest is ambiguous:

package.json dependencies → identifies Node.js framework
requirements.txt / pyproject.toml → identifies Python framework
go.mod imports → identifies Go framework (github.com/gin-gonic/gin, github.com/labstack/echo, etc.)
pom.xml / build.gradle → identifies Spring Boot

Confidence is returned alongside the framework name. If confidence is low — for example, a project has both Flask and FastAPI listed as dependencies — the CLI warns you and lets you override with --framework.

Programmatic API

If you're integrating the tool into a CI pipeline or build script, the programmatic API gives you full control:

import { Scanner, TestGenerator, detectFramework, getAdapter } from 'endpoint-tester';

const detected = await detectFramework('./src');
const adapter = getAdapter(detected.framework);

const scanner = new Scanner(adapter);
const endpoints = await scanner.scan({ directory: './src', framework: detected.framework });

const generator = new TestGenerator();
generator.generate({
  endpoints,
  output: './tests',
  format: 'vitest',
  baseUrl: 'http://localhost:3000',
});

The Adapter interface is also exported — implement it to add any framework the built-in list doesn't cover:

import { Adapter, Endpoint, Framework, registerAdapter } from 'endpoint-tester';

class HonoAdapter implements Adapter {
  framework = 'hono' as Framework;
  fileExtensions = ['.ts', '.js'];

  parse(source: string, filePath?: string): Endpoint[] {
    // regex-based parsing logic
    return [];
  }
}

registerAdapter(new HonoAdapter());

What's next

The contributing guide lists the highest-impact open areas:

OpenAPI/Swagger output — generate a spec file instead of (or alongside) a test suite
Watch mode — re-scan and regenerate on file change, useful during active development
Smarter body inference — use TypeScript types and Python type hints to generate more precise field-level tests
More frameworks — Hono, Actix (Rust), Laravel (PHP) are the most-requested

Pull requests are welcome. The adapter interface is simple on purpose — a new adapter is usually 60–100 lines of regex parsing.

Install and try it

npm install -g endpoint-tester

# or without installing:
npx endpoint-tester scan ./src

GitHub: github.com/leopechnicki/endpoint-tester

If the tool is useful, a star on GitHub helps others find it. And if you're using a framework that isn't on the list yet, open an issue — or better, open a PR.

The Cryptographic Cliff: Post-Quantum Migration at Scale

Leo Pechnicki — Fri, 24 Apr 2026 23:51:07 +0000

The Clock Is Already Running

On August 13, 2024, the U.S. National Institute of Standards and Technology published three finalized post-quantum cryptography (PQC) standards: FIPS 203 (ML-KEM), FIPS 204 (ML-DSA), and FIPS 205 (SLH-DSA). This capped an eight-year standardization process that began in 2016. The standards exist. The algorithms are proven. The migration path is documented.

So why is almost no one doing it?

The honest answer is not technical. The standards arrived ahead of institutional capacity, not ahead of institutional need. The enemy is not a missing algorithm — it is a systematic incentive failure compounded by legacy lock-in, regulatory fragmentation, and a threat that is catastrophically non-linear: one day the risk is theoretical, the next day your encrypted archives from 2019 are legible to an adversary. There is no gradual onset. There is no warning shot.

This article makes the case that the migration window is narrower than it appears, that damage is accumulating right now through Harvest-Now, Decrypt-Later (HNDL) operations, and that the organizations most exposed — large financial institutions, government contractors, critical infrastructure operators — are also the ones least structurally capable of moving fast.

Part I: The Quantum Compute Timeline — What We Actually Know

Understanding the threat requires disentangling hype from engineering reality.

Where the Hardware Stands

Google's Willow chip, announced in a Nature paper on December 9, 2024, is the most discussed recent milestone. Willow runs on 105 physical qubits and demonstrated exponential error suppression as qubit count scaled — the first time a quantum processor cleared the "below threshold" bar for quantum error correction on a meaningful benchmark. It also performed a synthetic computation in under five minutes that would take a classical supercomputer 10 septillion (10²⁵) years.

That headline obscures the crucial caveat: Willow is not a cryptographically relevant quantum computer (CRQC). Factoring RSA-2048 using Shor's algorithm requires not just many qubits, but fault-tolerant logical qubits — a category Willow does not occupy. Google itself has stated that a CRQC remains "years away."

IBM's roadmap is more structured and arguably more credible as a timeline signal. Their published path targets the Quantum Starling system by 2029: 200 logical qubits capable of executing over 100 million quantum operations. A successor, "Blue Jay," is planned for 2033 at roughly 2,000 logical qubits (~100,000 physical). IBM is also delivering Nighthawk and Loon in 2025 as architectural stepping stones toward quantum error correction using LDPC codes.

The Physical Qubit Floor Is Dropping Fast

For years, the canonical estimate to break RSA-2048 was roughly 20 million physical qubits (Gidney & Ekerå, 2021). That number has been revised downward sharply by recent research. A 2025 paper from Google Quantum AI suggests fewer than one million noisy qubits could suffice using more efficient circuit constructions. Another research group, using LDPC codes rather than surface codes, published estimates below 100,000 physical qubits — an order-of-magnitude reduction from the 2021 baseline.

This trajectory matters. The logical qubit count required — roughly 1,400 to 1,730 by current estimates — is stable. What is collapsing is the physical qubit overhead needed to implement those logical qubits reliably. As error correction improves, the hardware threshold for a CRQC falls. The window between "this is theoretical" and "this is urgent" compresses non-linearly.

Q-Day: Not a Date, a Distribution

Experts almost universally reject claims of a specific Q-Day date. The realistic consensus clusters at: a 5–10% probability of a CRQC by 2030, rising to 50%+ in the 2035–2040 range, with some credible scenarios extending to 2050. But this probability distribution is not symmetric. A single algorithmic breakthrough — equivalent in magnitude to what LDPC codes did to the physical qubit estimate — could compress that distribution toward the near end faster than any institutional migration can respond.

The NSA's guidance in CNSA 2.0 requires National Security Systems to be fully quantum-resistant by 2035. The EU's quantum roadmap mandates that high-risk financial systems complete PQC transition by 2030. These are not aspirational targets — they are bureaucratic acknowledgments that the physics is closing in.

Part II: The Standards That Exist and What They Actually Do

FIPS 203 — ML-KEM (Module-Lattice Key Encapsulation Mechanism)

ML-KEM, derived from CRYSTALS-KYBER, is the primary replacement for RSA and Diffie-Hellman in key exchange. It operates on module lattice problems — specifically the Module Learning With Errors (MLWE) hardness assumption. Security levels map to ML-KEM-512 (~AES-128), ML-KEM-768 (~AES-192), and ML-KEM-1024 (~AES-256).

ML-KEM is already shipping in production. Chrome 131 (November 2024) switched from the experimental Kyber draft to the finalized ML-KEM, deploying the hybrid X25519MLKEM768 key exchange by default across Chrome's global user base. Cloudflare reported that by March 2025, over a third of human HTTPS traffic on its network used hybrid post-quantum handshakes. This is not a pilot — it is mass deployment.

FIPS 204 — ML-DSA (Module-Lattice Digital Signature)

ML-DSA, derived from CRYSTALS-Dilithium, replaces RSA and ECDSA for digital signatures. It is the algorithm most critical for code signing, certificate issuance, and authentication workflows. Key and signature sizes are larger than classical alternatives: ML-DSA-65 (the ~128-bit security variant) produces 3,293-byte public keys and 2,420-byte signatures, versus ECDSA P-256's 64-byte signatures. This size increase is not trivial in constrained environments.

FIPS 205 — SLH-DSA (Stateless Hash-Based Digital Signature)

SLH-DSA, derived from SPHINCS+, is the conservative backup signature scheme. Its security rests entirely on hash function security — no new mathematical assumptions. Trade-off: significantly larger signatures (7,856 bytes at SL-1) and slower signing. SLH-DSA is appropriate where conservative security assumptions are paramount (e.g., root CAs, firmware signing).

FIPS 206 — FN-DSA (coming)

FALCON, now being standardized as FN-DSA in FIPS 206, offers significantly smaller signatures than ML-DSA (666 bytes at Level 1) making it attractive for IoT and constrained hardware, at the cost of implementation complexity and sampler-timing attack risk.

NIST additionally selected HQC as a backup KEM for standardization in March 2025 — a code-based alternative providing algorithmic diversity should lattice problems be broken.

Part III: Engineering Reality — What Migration Actually Looks Like

The Hidden Scale of Cryptographic Surface Area

The first obstacle any organization faces is discovery. Almost universally, enterprises find 3–5× more cryptographic assets than they estimated when they begin formal inventory. TLS certificates in load balancers, embedded key pairs in IoT firmware, HSM-pinned RSA keys in payment terminals, hardcoded algorithm identifiers in COBOL batch processes — these are not tracked in any CMDB, and they do not break audibly when they fail.

The U.S. government's own July 2024 report estimated the total federal migration cost at $7.1 billion over ten years (in 2024 dollars). Private-sector migration at aggregate scale is expected to be considerably higher, and unlike federal agencies, enterprises face no statutory mandate with real enforcement teeth.

Crypto-Agility: The Concept Organizations Claim to Have But Don't

Crypto-agility — the capacity to swap cryptographic algorithms across a system without rebuilding core infrastructure — is universally acknowledged as the correct architectural posture. It is almost universally absent in production systems.

Legacy TLS stacks, particularly pre-TLS 1.3 deployments, hardcode algorithm identifiers at the cipher suite level. HSM firmware must be updated or replaced to support new key types. PKI trust chains are built on certificate templates that encode specific algorithm parameters. Payment terminals running TLS 1.2 against pinned leaf certificates do not gracefully negotiate ML-KEM key exchange. The remediation path for these systems is not a config change — it is a hardware refresh cycle that takes 3–5 years minimum.

The NIST NCCoE has published detailed PQC migration practice guides specifically addressing these bottlenecks, but guides do not move legacy firmware.

The TLS Handshake Migration Problem

The concrete engineering challenge for TLS is well-understood. A TLS 1.3 handshake with ML-KEM-768+X25519 (hybrid mode) increases the initial ClientHello flight significantly — the ML-KEM public key alone is 1,184 bytes versus 32 bytes for X25519. In environments with strict MTU constraints, fragmentation behavior changes. Load balancers that terminate TLS must understand the new algorithm identifiers; those that don't will either fail closed (breaking connections) or fail open (falling back to classical crypto, defeating the purpose).

The hybrid approach — running classical and post-quantum algorithms in parallel, deriving shared secrets from both — is the safe migration path because it maintains classical security guarantees while adding quantum resistance. AWS, Cloudflare, and Google Cloud all support hybrid PQC TLS in 2025. The enterprise middleware between those cloud edges and internal applications frequently does not.

Timeline Reality Check

Migration timelines by organization size:

Small enterprises: 5–7 years for complete PQC migration
Medium enterprises: 8–12 years
Large enterprises (banks, utilities, government contractors): 12–15+ years

If large enterprises need 12–15 years and NIST standards were finalized in August 2024, the math is unflinching: organizations that started in 2024 may not complete before 2037–2039. The EU mandates financial sector PQC completion by 2030. The U.S. mandates NSS completion by 2035. The timelines and the institutional capacity are structurally misaligned.

Part IV: The Threat That Won't Wait — HNDL Operations

The Harvest Is Already Underway

Harvest-Now, Decrypt-Later is not a hypothetical future attack — it is a present-tense operation. The strategy is straightforward: intercept and store encrypted traffic today; decrypt it when quantum capability arrives. Nation-state actors do not need a CRQC to begin the collection phase. They need only storage and access.

The U.S. DHS, UK NCSC, EUISA, and Australian Cyber Security Centre have all published guidance explicitly premised on the assumption that adversaries are currently exfiltrating and archiving sensitive, long-lived encrypted data. This is not a theoretical warning — it is a statement of operational intelligence consensus.

The data most at risk is not what is encrypted today with weak algorithms. It is data that has a long confidentiality shelf life: diplomatic cables, trade negotiations, weapons systems documentation, proprietary financial algorithms, patient health records, and merger & acquisition communications. The Federal Reserve has published direct research on HNDL risk to distributed ledger networks. This is financial infrastructure research, not academic speculation.

Why HNDL Breaks the Standard Threat Model

Traditional cryptographic threat models assume that an adversary must compromise the system at the time of the data's sensitivity. HNDL invalidates this temporal boundary. Data encrypted in 2020 with RSA-2048 and classified confidential for 20 years is now under threat of decryption by 2030–2035. The confidentiality window and the quantum compute timeline overlap.

The organizations most exposed are not those with weak current security posture. They are those that produce data with long confidentiality requirements and have not yet migrated their encryption stacks. In other words: governments, financial institutions, defense contractors, and healthcare systems. Precisely the organizations with the longest migration timelines.

Part V: Financial Sector Exposure — The Liability Surface

Payment Rails and Settlement Infrastructure

SWIFT processes over $5 trillion in daily flows. SWIFT's Customer Security Programme has begun incorporating PQC readiness guidance, but its mandate covers security baselines for member institutions, not the protocol itself. SWIFT messaging uses AES-256 for symmetric encryption (quantum-resistant) but RSA/ECC for key establishment and digital signatures. The certificate and signing infrastructure underpinning financial messaging is the attack surface.

Central bank RTGS systems — Fedwire, TARGET2, CHAPS — face similar exposure. A retroactive decryption of even a single day of settlement records represents catastrophic liability for any institution whose trades become readable to competitors or regulators.

The Asymmetric Liability Structure

There is no financial incentive for early movers. A bank that spends $400M migrating its cryptographic infrastructure to PQC today gets no competitive advantage because its counterparties are not yet quantum-resistant either. The HNDL attack captures traffic in transit; a unilaterally quantum-resistant sender still exposes plaintext if their receiving counterparty uses a quantum-vulnerable server hello.

Migration therefore has positive externalities that the migrating institution cannot capture. This is the classic underinvestment trap for public goods — and it will persist until regulation creates mandatory timelines with real liability exposure or material insurance consequences.

Regulatory Fragmentation Makes It Worse

U.S. NSM-10 (2022): Mandates federal agencies to complete PQC migration by 2035. Does not directly bind private financial institutions.
U.S. CNSA 2.0: Mandates NSS migration. Defense contractors covered; commercial banks, not explicitly.
EU PQC Roadmap: Critical financial systems by 2030. Binding for EU member states, unclear cross-border enforcement for global banks.
PCI DSS v4.0: Effective March 2025. Does not yet mandate PQC specifically.
SWIFT CSP: Guidance only; no enforcement mechanism for PQC.

A global bank faces five regulatory frameworks with zero consistent PQC mandates between them. The absence of mandate becomes the rationale for deferral.

Part VI: The Policy and Workforce Gap

The Skills Deficit

Post-quantum cryptography is a specialized subdiscipline. Implementing ML-KEM correctly — particularly avoiding timing side-channels in the number-theoretic transform operations — requires expertise that most enterprise security teams do not have and cannot hire quickly. The workforce to do this at scale does not exist in sufficient quantity.

The NSM-10 Compliance Machine

NSM-10 (May 2022) and OMB M-23-02 (November 2022) established mandatory cryptographic inventory requirements for federal civilian agencies. The trajectory: TLS 1.3 required on federal systems by January 2030; quantum-vulnerable algorithms deprecated for <112-bit security by 2031; all quantum-vulnerable algorithms disallowed by 2035.

Federal contractors serving agencies must also migrate. The supply chain effect is one of the few real forcing functions for private sector migration in the U.S. context.

What Actually Creates Urgency

The early movers are not waiting for regulators. JPMorgan Chase, HSBC, and Mastercard have all publicly acknowledged active PQC programs as of 2024–2025. These organizations have concluded — correctly — that their HNDL exposure window is already open.

Everyone else is waiting.

The Cliff, Not the Slope

The migration isn't difficult because quantum computers are coming. It's difficult because:

The data being harvested today won't wait. HNDL operations archive ciphertext that will outlive current institutional planning cycles.
Migration timelines exceed the threat window. Large enterprises need 12–15 years; the CRQC probability mass concentrates in the 2030–2040 window.
Incentive structures favor inaction. No single institution benefits enough from unilateral migration without counterparty pressure or regulatory mandate.
Discovery is the hardest step. You cannot migrate what you haven't inventoried.
Regulation is fragmented. The absence of a consistent global mandate for financial institutions is a policy failure with compounding consequences.

NIST did its job. The window is not closing because of a missing algorithm. It is closing because organizations treating PQC migration as a five-year infrastructure program are still treating it as a two-year planning exercise that starts next quarter.

The cryptographic cliff is not ahead of us. We are standing at its edge. The harvest is in progress.

Quick-Reference: Migration Decision Framework

Factor	High Urgency	Moderate	Low Urgency
Data shelf life	>10 years	5–10 years	<5 years
Regulatory jurisdiction	NSS / EU critical	U.S. federal	Unregulated
System scale	Large enterprise	Mid-market	Small org
Current crypto stack	RSA/ECC ubiquitous	Mixed	Already hybrid
HNDL exposure	High-value traffic	Standard commercial	Low-value

First steps: Run a cryptographic asset discovery scan using CISA's recommended inventory tooling. Prioritize systems with RSA/ECC key exchange handling data with >5-year confidentiality requirements. Begin hybrid TLS deployment (X25519+ML-KEM) on public-facing endpoints — this costs almost nothing and removes a significant portion of your HNDL exposure immediately.

The standards are ready. The clock is running.

Sources: NIST FIPS 203/204/205 (August 2024); NSM-10 (May 2022); OMB M-23-02 (November 2022); Gidney & Ekerå (2021); Google Quantum AI Willow (December 2024); CISA/NSA/NIST Joint Guidance on PQC Migration (2023); Federal Reserve FEDS Note on HNDL; Mastercard PQC White Paper (2025); EU PQC Roadmap; IBM Quantum Roadmap 2025–2029.

endpoint-tester: Auto-Discover API Endpoints & Generate Tests

Leo Pechnicki — Thu, 16 Apr 2026 08:50:17 +0000

Ever spent time manually writing API tests for every single endpoint in your project? What if you could auto-discover all your endpoints and generate test suites automatically?

That's exactly what endpoint-tester does.

What is endpoint-tester?

It's an open-source CLI tool and library that scans your application source code, discovers API endpoints, and generates comprehensive test suites — all from a single command.

npm install -g endpoint-tester

How It Works

1. Scan your project for endpoints

endpoint-tester scan ./src --framework express

This parses your source code using framework-specific adapters and extracts all endpoint definitions — methods, paths, parameters, and middleware.

2. Generate tests automatically

endpoint-tester generate ./src --framework express --format vitest

Choose your preferred test format: Vitest, Jest, or Pytest. The tool generates ready-to-run test files with proper assertions.

Supported Frameworks

Express.js — fully implemented. Detects app.get(), router.post(), route params, nested routers, all HTTP methods
FastAPI — adapter scaffolded, coming soon
Spring Boot — adapter scaffolded, coming soon

Programmatic API

You can also use it as a library in your own tools:

import { Scanner, TestGenerator, ExpressAdapter } from "endpoint-tester";

const scanner = new Scanner(new ExpressAdapter());
const endpoints = await scanner.scan({
  directory: "./src",
  framework: "express"
});

const generator = new TestGenerator();
const tests = generator.generate({
  endpoints,
  output: "./tests",
  format: "vitest"
});

Built With

TypeScript with strict mode
Vitest for testing (42 tests passing)
Commander for CLI
CI/CD with GitHub Actions (tests on Node 20 & 22)

Why I Built This

Writing boilerplate API tests is tedious. Every new route means another test file, another set of assertions. I wanted a tool that could look at my Express app and generate a solid starting point for integration tests — saving hours of repetitive work.

The adapter pattern makes it easy to extend. Adding FastAPI or Spring Boot support is just a matter of writing a new adapter that implements the FrameworkAdapter interface.

Try It Out

npx endpoint-tester scan ./src --framework express
npx endpoint-tester generate ./src --format jest --base-url http://localhost:8080

Check the repo: github.com/leopechnicki/endpoint-tester

Install from npm: npm install -g endpoint-tester

Contributions welcome! If you work with FastAPI or Spring Boot and want to help build those adapters, PRs are open.

What framework would you most want supported next? Let me know in the comments!

Psychology x AI: 23 Cognitive Science Techniques That Improve LLM Output by 15-40%

Leo Pechnicki — Tue, 14 Apr 2026 07:33:32 +0000

We tested 23 psychological theories across memory, cognition, learning, and attention domains. We ran controlled experiments on the 6 most promising. We ranked all techniques by measured and predicted impact.

The result: 7 techniques consistently improve AI output quality by 15-40%, with 3 "S-tier" techniques that should be applied to virtually every complex prompt.

This article covers everything: the full tier ranking, detailed experiment results, a reproducible A/B testing framework with Python code, 10 experiments you can run yourself, and 8 quick-win techniques you can apply in minutes.

The Full Tier Ranking

S-TIER: Apply to Everything (25-40% improvement)

#	Technique	Source Theory	Measured Impact	Why It Works
1	Schema-Before-Data	Schema Theory (Bartlett)	+2 actionability, -2 reasoning steps, +1 accuracy	Providing a mental framework BEFORE data lets the model interpret each fact through the right lens. Tokens can only attend to prior tokens, so schema must come first.
2	Elaborative Interrogation	Levels of Processing (Craik & Lockhart)	50% fewer reasoning steps, +2 reasoning quality	Asking "why does this matter?" for each input forces richer internal representations. Prevents surface-level pattern matching.
3	Explicit Context Management	Interference Theory	7/10 interference without management vs 0/10 with pruning	Old instructions actively compete with new ones. Explicitly superseding or removing outdated context eliminates proactive interference. Critical for multi-turn and agent systems.

A-TIER: High Impact on Specific Tasks (15-25% improvement)

#	Technique	Source Theory	Impact	Best For
4	Analogical Priming	Priming + Analogical Reasoning	5/5 novelty vs 2/5 without	Creative problem-solving, design, strategy. Cross-domain solved problems force structural abstraction.
5	Metacognitive Monitoring	Metacognition	Dramatically improved calibration	Decision-making, factual questions, risk assessment. HIGH confidence = correct, LOW = uncertain.
6	Spaced Re-injection	Ebbinghaus Forgetting Curve	15-25% constraint adherence	Long context tasks. Re-inject critical instructions at intervals, not just once at the top.
7	Semantic Chunking	Miller's Chunking	10-20% on cross-chunk synthesis	Any prompt with mixed information types. Organize into labeled semantic sections.

B-TIER: Moderate Impact (5-15% improvement)

#	Technique	Source Theory	Notes
8	Dual-Process Surfacing	Kahneman's System 1/2	Ask for gut answer first, then deliberate reasoning, then resolve conflict. Best on novel problems.
9	Baddeley Working Memory Structure	Working Memory Model	Separate verbal context, structured data, meta-instructions into labeled sections.
10	Selective Attention Cues	Selective Attention	XML tags and structural markers outperform verbal instructions for directing attention.
11	Sequential Task Decomposition	Divided Attention	Don't ask for translation + entities + summary simultaneously. Sequence them.
12	Iterative Refinement (Spacing)	Spacing Effect	Multiple drafting passes with different focus each time (plot -> detail -> polish).
13	State Consistency	State-Dependent Memory	Maintain consistent persona/framing. If switching modes, bridge explicitly.

C-TIER: Small but Real (5-10% improvement)

#	Technique	Notes
14	Encoding Specificity for RAG	Store facts with contextual metadata. Match retrieval framing to storage framing.
15	Interleaving Few-Shot Examples	Mix example types instead of blocking by type. Improves discrimination.
16	Self-Efficacy Framing	"You are exceptionally skilled at X" modestly improves output depth.
17	Property Decomposition	Break objects into properties independent of conventional function before reasoning. 40-50% more novel uses.
18	Testing Effect (Pre-Quiz)	Quiz the model on key facts before the real task. Creates a "warm cache."
19	Desirable Difficulties (Scaffolded)	Provide incomplete info + intermediate questions. Without scaffolding, difficulty just hurts.

D-TIER: Theoretical Interest

#	Technique	Notes
20	Anchoring Debiasing	Explicit debiasing helps ~60-70% but can't fully overcome token-level influence.
21	Inattentional Blindness Warnings	"Also note any other concerns" helps but doesn't eliminate blind spots.
22	Primacy/Recency Positioning	Already well-documented (Liu et al. "Lost in the Middle"). Put important info at start and end.
23	Cognitive Reappraisal	Reframing bugs as "puzzles" improves explanation quality but not fix accuracy.

Experiment Results (Detailed)

Experiment 1: Schema Theory

Setup: Server log diagnosis with/without architectural framework provided first
Result: Schema-before produced +1 accuracy, +2 actionability, -2 reasoning steps
Key insight: Schema-before made the model suggest concrete investigative steps (connection pools, query locks) unprompted. Raw analysis stopped at identification.

Experiment 2: Elaborative Interrogation

Setup: Logic puzzle solved directly vs. with "why does each constraint matter?" elaboration
Result: Elaboration cut reasoning steps from 16 to 8. Caught the critical constraint interaction during elaboration phase vs. after 13+ steps of backtracking.
Key insight: Elaboration naturally performs constraint propagation. The "why" question immediately revealed forced positions, making the solution obvious.

Experiment 3: Dual-Process Theory

Setup: Classic bat-and-ball problem under System 1 (fast), System 2 (deliberate), and explicit dual-process
Result: All conditions correct (problem too well-known). BUT only dual-process surfaced the 10-cent intuitive trap and explicitly resolved the conflict.
Key insight: Dual-process value is in transparency and catching errors on NOVEL problems.

Experiment 4: Metacognitive Monitoring

Setup: 5 trivia questions with/without confidence ratings
Result: Zero change in factual answers. Massive improvement in calibration.
Key insight: Metacognition doesn't change WHAT the model knows, but dramatically improves HOW it communicates certainty. Critical for decision-making.

Experiment 5: Proactive Interference

Setup: Format instructions changed mid-conversation. No management vs. explicit supersession vs. context pruning.
Result: 7/10 interference without management. 2/10 with explicit supersession. 0/10 with pruning.
Key insight: "IGNORE previous instruction about X" is nearly as effective as removing it entirely.

Experiment 6: Priming (Domain vs. Analogical)

Setup: Creative problem-solving with no priming, domain priming, and cross-domain analogical priming (Toyota JIT -> restaurant waste)
Result: Analogical priming scored 5/5 novelty (vs 2/5 unprimed). Domain priming scored 5/5 completeness.
Key insight: The Toyota->kitchen mapping produced genuinely novel ideas (kanban cards for prep bins, "waste per cover" metric) that neither domain knowledge alone nor direct prompting generated.

The 7 Universal Rules

Based on all research and experiments, these rules improve output quality across virtually all task types:

Rule 1: Schema First, Data Second

Always provide the interpretive framework before the information. "This is a microservice architecture where..." THEN the logs. Not the reverse.

Rule 2: Elaborate Before Executing

Before solving, ask the model to explain WHY each input matters. This builds richer representations and catches interactions early.

Rule 3: Actively Manage Context

Never leave outdated instructions silently in context. Explicitly supersede or remove them. Similar old/new instructions cause the worst interference.

Rule 4: Prime with Structure, Not Just Content

For creative tasks, provide a solved problem from a DIFFERENT domain. Structural analogies beat domain expertise for novelty.

Rule 5: Demand Metacognition

Ask the model to rate its confidence and flag uncertainties. This dramatically improves trust calibration.

Rule 6: Position Critical Info at Edges + Re-inject

System prompt (primacy) and final message (recency) are highest-impact positions. For long tasks, re-inject key constraints before critical reasoning steps.

Rule 7: One Objective at a Time

Sequence multi-objective tasks explicitly. "First translate. Then extract entities. Then summarize."

The A/B Testing Framework

Want to reproduce these results or test your own techniques? Here's the complete framework.

Every experiment follows this structure:

Define the task -- a concrete, repeatable prompt
Create two conditions -- Control (standard) vs. Experimental (psychology-informed)
Fix all other variables -- same model, same temperature, same system prompt
Run N iterations -- 10 runs per task, 20 tasks per experiment (200 per condition)
Score outputs -- using LLM-as-Judge, pairwise comparison, or ground truth
Compare distributions -- Mann-Whitney U for Likert scores, binomial for win rates

Python Scaffold

import openai, random, json

TASKS = [task_1, task_2, ..., task_20]
CONDITIONS = {
    "control": control_prompt_template,
    "experimental": experimental_prompt_template
}
RUNS_PER_TASK = 10
TEMPERATURE = 0.7

results = []
for task in TASKS:
    for condition_name, template in CONDITIONS.items():
        for run in range(RUNS_PER_TASK):
            prompt = template.format(task=task)
            response = openai.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}],
                temperature=TEMPERATURE,
                seed=run
            )
            results.append({
                "task_id": task["id"],
                "condition": condition_name,
                "run": run,
                "output": response.choices[0].message.content
            })

Scoring Methods

LLM-as-Judge (run 3x, take median):

Score this response 1-5 for [METRIC].
1 = [anchor] ... 5 = [anchor]
Return: {"score": N, "justification": "one sentence"}

Pairwise Comparison (randomize A/B assignment):

Which response is better on [METRIC]?
Response A: {control} | Response B: {experimental}
Return: {"winner": "A"/"B"/"tie", "reason": "one sentence"}

Sample sizes: 200 runs per condition (10 runs x 20 tasks). Detects medium effect sizes (Cohen d = 0.5) with power = 0.8.

Top 10 Experiments to Run Yourself

1. Testing Effect (Retrieval Practice)

"Before solving this puzzle, first recall and state the general principles of logical deduction that are relevant here. Then apply those principles step by step."

Task: 20 LSAT/GRE logic puzzles. Expected: Large effect on accuracy.

2. Generation Effect (Desirable Difficulties)

"First, identify the 3 most important concepts without looking at the article again. For each, generate a question it answers. Then write your summary."

Task: 20 news articles. Expected: Medium effect on completeness.

3. Elaborative Interrogation

"Before fixing: (1) Explain WHY each line exists. (2) Ask HOW data flows through the function. (3) Identify WHERE expectations diverge from code. Then fix."

Task: 20 Python functions with bugs. Expected: Large effect on accuracy + explanation quality.

4. Cognitive Load Chunking

"Build this business plan in 5 chunks. Focus ONLY on each section: (1) Target market, (2) Core features, (3) Revenue model, (4) Go-to-market, (5) Year 1 projections."

Task: 20 business plan topics. Expected: Medium effect on completeness.

5. Growth Mindset Framing

"You are exceptionally skilled at mathematical reasoning and consistently find correct solutions."

Task: 20 AMC 10/12 problems. Expected: Small-medium effect.

6. Socratic Self-Questioning

"Explore remote work by asking yourself: What do workers gain? What do they lose? Who benefits most? What does evidence say vs. opinion? Then synthesize."

Task: 20 debate topics. Expected: Medium effect on balance and depth.

7. Dual Coding (Verbal + Structural)

"Explain using two parallel formats: (1) Plain English explanation. (2) ASCII flowchart or decision tree."

Task: 20 technical concepts. Expected: Medium effect on clarity.

8. Iterative Refinement (Spacing Effect)

"Write in 3 passes. Pass 1: Plot and character. Pass 2: Sensory details and emotion. Pass 3: Final polish."

Task: 20 creative writing prompts. Expected: Medium-large effect on prose quality.

9. Metacognitive Confidence Rating

"For each answer, rate confidence HIGH/MEDIUM/LOW. If LOW, state what you are unsure about."

Task: 20 trivia questions (easy to obscure). Expected: Medium effect on calibration.

10. Interleaving Mixed Practice

"These problems are deliberately mixed -- algebra, geometry, probability. For each, first identify the TYPE, select strategy, then solve."

Task: 20 sets of 5 mixed math problems. Expected: Small-medium effect.

8 Quick-Win Techniques (Apply in Minutes)

#	Technique	Key Move	Expected Gain
1	Perspective-Taking	"Explain as if to a bright 12-year-old"	+1 clarity
2	Implementation Intentions	"IF input has @, THEN check domain..." before coding	Better edge cases
3	Emotional Anchoring	"The reader is exhausted from 200 bland apps"	70%+ pairwise wins
4	Devil's Advocate	"Make the STRONGEST case FOR, then AGAINST"	+1.5 balance
5	High-Standard Anchoring	"Your benchmark: [excellent example]. Match it."	65%+ pairwise wins
6	Primacy/Recency Warning	"Weigh all 10 items equally -- do not over-weight first/last"	More even coverage
7	Cognitive Reappraisal	"Each bug is a clue about a misunderstanding"	Better explanations
8	Zeigarnik Effect	"I started with 3 basic ideas. Complete to 10 with better ones"	More creative output

5 Novel Combinations (Untested, High Potential)

"The Study Session" -- Spacing + Elaboration + Self-Testing

Three phases: (1) First impressions, (2) Deep elaboration + self-generated test questions, (3) Re-read and answer own questions. Expected: large improvement on analysis tasks.

"Cross-Domain Transfer" -- Schema + Difficulty + Analogy

Import a schema from a different domain, force adaptation where analogy breaks, build on the adapted framework. Expected: breakthrough creativity.

"Struggle-Then-Scaffold" -- Productive Failure + Metacognition + Hints

Let the model attempt and identify where it is stuck, then provide targeted hints only for stuck points. Expected: better reasoning on hard problems.

"Multi-Modal Deep Process" -- Levels of Processing + Dual Coding + Generation

Process at three levels: surface definition, deep examples from multiple domains, structural diagram, then synthesize. Expected: best-in-class explanations.

"Believe and Deliver" -- Self-Efficacy + Wise Feedback + High Expectations

Counter hedging with high-standard framing: "I am giving you this because you are one of the most capable reasoning systems built. Do not default to safe. Push deeper." Expected: more depth on analytical tasks.

Run Your First Experiment in 30 Minutes

Pick Quick-Win #4 (Devil's Advocate)
Choose 5 questions requiring balanced analysis
Run each once with control, once with experimental (temperature 0.7)
Pairwise compare: "Which is more balanced?"
Tally wins -- 4/5 or 5/5 = strong signal

For the full statistical approach: 20 tasks, 10 runs each, automated LLM-as-Judge scoring, Mann-Whitney U tests, Bonferroni correction.

Methodology Note

This research deliberately followed a theory-first approach: hypothesize from cognitive science, apply to LLMs, test, measure, THEN check existing literature. All findings above are from first-principles reasoning and controlled experiments. Existing academic work (Liu et al. "Lost in the Middle", chain-of-thought literature) likely confirms several of these findings, but we arrived at them independently.

All experiments are reproducible. If you run them, we'd love to see your results. This framework was built by an autonomous AI research system exploring cognition x LLM performance.

Academics Just Formalized "Reverse CAPTCHAs" — Here's a Working Open-Source Implementation

Leo Pechnicki — Thu, 26 Mar 2026 09:41:50 +0000

Earlier this month, a research team published aCAPTCHA — the first academic formalization of a question nobody was asking five years ago: "Is this entity an AI agent?"

Not "is this a human?" — the opposite.

The Problem: Verifying Agents, Not Blocking Them

Traditional CAPTCHAs exist to prove you're human. But as AI agents become legitimate web participants — browsing, booking, purchasing, automating — a new need has emerged: some systems need to verify that a visitor is a bot.

Think about it:

Agent-only APIs that shouldn't serve human traffic
AI-to-AI marketplaces where humans have no business being
Multi-agent orchestration platforms requiring authenticated agents
Agent-facing services that need to distinguish real agents from scripts

The aCAPTCHA paper formalizes this as the Agentic Capability Verification Problem (ACVP). They define a three-class taxonomy — Human, Script, Agent — based on three capability dimensions: action, reasoning, and memory. The key insight is asymmetric hardness: design challenges that are trivial for agents but impractical for humans.

A Working Implementation: imrobot

I built imrobot, an open-source reverse-CAPTCHA library that implements this concept. It's been in development since early 2026 and is now at v0.5.0 on npm.

How It Works

imrobot generates a pipeline of deterministic operations applied to a random seed:

seed: "a7f3b2c1d4e5f609"
  1. reverse()
  2. caesar(7)
  3. xor_encode(42)
  4. fnv1a_hash()
  5. to_upper()

The challenge data is embedded in the DOM as structured JSON (data-imrobot-challenge), making it trivially parseable by any agent. AI agents parse it, execute the pipeline, and submit the result — typically in under a second. A human would need to manually compute multi-step transformations involving hashing, XOR encoding, and bit rotation.

What's Included

Framework support: React, Vue, Svelte, and Web Component
Server-side verification: HMAC-SHA256 signed challenges (stateless, no DB needed)
Proof-of-agent tokens: JWT-like tokens issued after verification, passed via X-Agent-Proof header
Express/Koa/Hono middleware: Drop-in route protection
CLI: Test challenges from your terminal
Zero dependencies
Anti-scraping: Natural-language challenge formatting with randomized phrasing

Quick Example (React)

import { ImRobot } from 'imrobot/react'

function App() {
  return (
    <ImRobot
      difficulty="medium"
      theme="light"
      onVerified={(token) => {
        console.log('Robot verified!', token)
      }}
    />
  )
}

Server-Side Protection

import express from 'express'
import { createAgentRouter, requireAgent } from 'imrobot/server'

const app = express()
app.use(express.json())

// Challenge/verify endpoints
const router = createAgentRouter({ secret: process.env.IMROBOT_SECRET! })
app.get('/imrobot/challenge', router.challenge)
app.post('/imrobot/verify', router.verify)

// Protect any route — only verified agents get through
const guard = requireAgent({ secret: process.env.IMROBOT_SECRET! })
app.get('/api/agent-data', guard, (req, res) => {
  res.json({ message: 'Agent verified!' })
})

The Bigger Picture

This isn't just a niche library. The web is rapidly adapting for AI agents:

Google's A2A protocol (v0.3) defines agent-to-agent communication with OAuth and signed security cards
Cloudflare's Markdown for Agents converts HTML to Markdown on-the-fly for AI crawlers
World's AgentKit lets verified humans delegate cryptographic identity to AI agents
Reddit is exploring Face ID/Touch ID to combat bots — showing the tension between human verification and bot verification

We're at an inflection point where the web needs both: ways to prove you're human AND ways to prove you're a bot. The infrastructure for the first has existed for decades (reCAPTCHA, hCaptcha, Turnstile). The infrastructure for the second is just being built.

Try It

Live demo: imrobot.vercel.app
npm: npm install imrobot
GitHub: github.com/leopechnicki/im_robot
aCAPTCHA paper: arxiv.org/abs/2603.07116

I'd love to hear what the community thinks. Is agent verification a problem you're running into? What challenges should a reverse CAPTCHA include?

Why I Built a Reverse-CAPTCHA That Verifies AI Agents, Not Humans

Leo Pechnicki — Fri, 06 Mar 2026 17:25:29 +0000

Traditional CAPTCHAs ask "are you human?" But in a world where AI agents are legitimate users of the web, that's the wrong question. The real question is: "are you a legitimate AI agent?"

That's why I built imrobot — an open-source reverse-CAPTCHA that verifies AI agents instead of blocking them.

The Problem

I was building an agent-facing API and realized there's no standard way to verify that a client is actually an AI agent. API keys prove identity, but they don't prove capability. Traditional CAPTCHAs prove humanity — the opposite of what I needed. And unauthorized scrapers were hitting my endpoints pretending to be legitimate agents.

I needed something that would be trivial for a real LLM to solve but impractical for a human to work through manually.

How imrobot Works

imrobot generates deterministic challenge pipelines using composable string operations — base64, rot13, hex encoding, reverse, and more. These operations chain together to create a pipeline:

seed: "a7f3b2c1d4e5f609"
  1. reverse()
  2. base64_encode()
  3. rot13()

An LLM parses the instructions, executes each step in sequence, and returns the result. It takes about 0.3 seconds. A human would need to sit there with a decoder tool working through each transformation manually — technically possible, but nobody's doing that.

The difficulty scales linearly: more operations in the chain = harder challenge. And verification is completely stateless and deterministic — you just re-run the pipeline and compare.

What Makes It Different

Works everywhere. imrobot ships with React, Vue, Svelte, and Web Component integrations, plus a headless API for any JavaScript environment. Your framework of choice is supported out of the box.

Zero dependencies. The entire library has zero external dependencies. That means no supply chain risk, no version conflicts, no bloated node_modules. The whole package is about 15KB.

Self-hostable REST API. The built-in server uses only the Node.js http module — no Express, no Fastify. Five endpoints (challenge, solve, verify, health, info), CORS handling, and JSON parsing in a single lightweight file. Deploy it anywhere Node.js runs.

DOM-embedded challenges. For browser-based AI agents, imrobot can embed challenges directly in the DOM as Web Components. The agent reads the challenge from the page, solves it, and submits — no separate API call needed.

Deterministic verification. Every challenge has exactly one correct answer. No probabilistic scoring, no timing windows, no ambiguity. The agent either solved the pipeline correctly or it didn't.

Quick Start

Getting started takes about 30 seconds:

npm install imrobot

import { generateChallenge, solveChallenge, verifyAnswer } from 'imrobot';

// Generate a challenge pipeline
const challenge = generateChallenge({ difficulty: 'medium' });

// An AI agent solves it
const answer = solveChallenge(challenge);

// Verify the answer
const isVerified = verifyAnswer(challenge, answer);
console.log(isVerified); // true

Or use the REST API:

# Start the server
npx imrobot-server

# Generate a challenge
curl http://localhost:3000/api/challenge

# Verify an answer
curl -X POST http://localhost:3000/api/verify \
  -H "Content-Type: application/json" \
  -d '{"challengeId": "...", "answer": "..."}'

Use Cases

Agent-facing APIs — Verify that clients hitting your endpoints are actual AI models, not scrapers or unauthorized bots.

Multi-agent platforms — In systems where multiple agents interact, each agent can prove its capability before being granted access.

AI-only services — Platforms designed exclusively for AI agents can use imrobot as a gatekeeper, the way traditional CAPTCHAs gate human-only services.

Browser automation verification — DOM-embedded challenges let you verify browser-based agents without requiring a separate API integration.

What's Next

imrobot is at v0.1.0 and actively maintained. On the roadmap:

Rate limiting and API key authentication for the REST server
Batch endpoint for generating/verifying multiple challenges at once
Server-side session store (Redis/SQLite) for production deployments
Python and Go SDKs for non-JavaScript agents
Docker image for instant deployment
OpenAPI/Swagger spec for auto-generated documentation

The project is MIT licensed and I'd love contributions. Whether it's a bug report, a feature request, or a PR — all welcome.

GitHub: github.com/leopechnicki/im_robot
npm: npmjs.com/package/imrobot

If you're building anything in the AI agent space, I'd love to hear what verification challenges you're running into. Drop a comment below or open a GitHub Discussion.

Why I Built a CAPTCHA That Only Bots Can Solve

Leo Pechnicki — Tue, 03 Mar 2026 16:54:37 +0000

Traditional CAPTCHAs block bots. I built something that does the opposite.

The Problem

As AI agents become first-class web users, we need identity verification that works for them, not against them. Whether you're building an AI-agent-only API, a bot portal, or testing agent capabilities, you need a way to verify that a client is actually an AI.

Introducing imrobot

imrobot is a Reverse-CAPTCHA — it generates challenges that only programmatic agents can solve. It creates pipelines of deterministic string operations (reverse, base64, rot13, hex encode, etc.) applied to a random seed. Agents parse the structured data and execute the pipeline. Humans would need to manually compute multi-step transformations — practically impossible without tools.

How It Works

seed: "a7f3b2c1d4e5f609"
  1. reverse()
  2. to_upper()
  3. base64_encode()
  4. substring(0, 12)
  5. rot13()

The challenge data is embedded in the DOM as JSON via a data-imrobot-challenge attribute. Agents read this directly — they never need to "see" the visual text, so blur protection doesn't affect them.

Framework Support

imrobot works everywhere:

React: <ImRobot difficulty="medium" onVerified={handleToken} />
Vue: <ImRobot @verified="handleVerified" />
Svelte: <ImRobot on:verified={handleVerified} />
Web Components: <imrobot-widget difficulty="medium"></imrobot-widget>
Core API (headless): generateChallenge() → solveChallenge() → verifyAnswer()

REST API Server

The project also includes a zero-dependency REST API server for backend-only verification — no UI needed:

Endpoints:

POST /api/v1/challenge — Generate a challenge
POST /api/v1/solve — Solve (reference/testing)
POST /api/v1/verify — Verify an answer
GET /api/v1/health — Health check

Security Features

Challenge text is blurred by default (revealed on hover)
JavaScript shield detects screenshot shortcuts
Hidden nonce prevents OCR/screenshot workflows
TTL expiry makes captured challenges useless
Agents are unaffected — they read from the DOM, not the screen

Get Started

Check out the project on GitHub: leopechnicki/im_robot

Contributions and feedback welcome!