DEV Community: Aviad Rozenhek

The Bitter Lesson for Backend Engineering

Aviad Rozenhek — Thu, 28 May 2026 13:26:46 +0000

Every senior backend engineer has sat in a war room while smart people argue
over instincts.

"It is the ASGI event loop."
"No, it is the database proxy."
"Just throw more pods at it."

The Core Thesis

We treat optimization as a test of intuition. But in "The Bitter Lesson,"
Richard Sutton's account of 70 years of AI research, the durable winners were
not the systems that best encoded human intuition. They were the general methods
that scaled with computation, especially search and learning. Human knowledge
helped locally and felt satisfying, but it tended to plateau. Compute-scaled
methods kept improving as compute got cheaper.

Backend engineering is facing its own bitter lesson. The future of systems
engineering is not "replacing judgment with agents." It is:

Build the system of work so that more computation can be turned into more
correct engineering progress.

For this backend effort, that means using agent compute to run a systematic
engineering loop. The human sets the goal, picks the risk boundary, and decides
what evidence is good enough. The agents spend parallel compute on the tedious
but decisive work: reading call paths, building harnesses, running experiments,
fitting scaling curves, checking payload equivalence, and keeping a durable log.

The Problem We Unleashed It On

The immediate problem was not "make one slow endpoint faster." It was harsher:
we needed a path to 500-1000 RPS on endpoints we had not even written yet.

At the same time, many existing endpoints were already struggling at
single-digit RPS. Horizontal scaling was ineffective: each additional CPU or pod
contributed too little throughput. There was no single obvious culprit. It
might have been Python CPU, serializer shape, the ORM, database compute,
connection fan-out, external calls, worker slots, event-loop blocking, load
shedding, or some interaction between them.

That made local cleverness the wrong tool. We did not need one more confident
guess. We needed a methodology that could discover what had to be better before
the next generation of endpoints was designed around the same bottlenecks.

The Limits Of The 30 W Brain

A human body runs at roughly 100 W; the brain is often estimated around 20-30 W.
That biological efficiency is remarkable, but it is also a bottleneck. A single
engineer can only hold so many hypotheses, command histories, query plans,
deploy states, and before/after measurements in working memory.

Backend engineers pride themselves on being empirical. We talk in latency
histograms, queue depths, lock contention, p95s, CPU profiles, connection pools,
and query plans. Yet we still fall into vibe-based optimization because real
systems exceed human working memory.

The irony is that backend performance is governed by rigid laws: discrete math,
queuing theory, network physics, storage behavior, CPU scheduling, cache
invalidation, and contention. Unlike product taste or visual design, these
systems eventually answer to measurement. But a 30 W brain cannot accurately
simulate an asymmetric network bottleneck, a contended connection pool, and a
multi-join query plan under load.

It certainly cannot do that while also tracking deploy state, cache behavior,
and error budgets.

Cheap-enough agent compute changes the shape of the work. Three agents can
simultaneously:

design instrumentation;
build harnesses and runbooks;
inspect hot paths;
prove behavior equivalence;
watch readiness, proxy metrics, app metrics, and logs;
maintain shared coordination state.

The rational move is to spend compute to reduce uncertainty. The scarce
resource is no longer typing time. It is experimental discipline.

So we reach for plausible stories. The event loop. The ORM. The GIL. The proxy.
The pods. The cache. The database. Each story may contain truth, but treating
the story as the strategy is exactly the human-centric mistake Sutton described.

The scalable strategy is to turn those stories into experiments and let compute
do the search.

The Ego Cost Is Familiar

Sutton's essay is uncomfortable because it says researchers repeatedly preferred
approaches that embedded human insight, even after scalable computation kept
winning. The pain was not just technical; it was identity. It feels better to
believe progress comes from our clever model of the domain than from a general
method that searches harder, learns more, or evaluates more candidates.

Backend engineering has the same temptation. The scalable engineering method is
to make each plausible story falsifiable and let the measurements decide.

The Case Study

Three specialized agents were turned loose on one hot endpoint — a paginated
list view central to the product — with instructions to find the binding
constraint, not the prettiest fix. A gentle 3 RPS live stress run did the
work: the database hit ~100% CPU while the proxy and its connections sat
flat. The bottleneck was DB compute. Python CPU was falsified separately by
a local sweep (parallel efficiency held at ~93% across four-way parallelism
with negligible variance). Connection fan-out and the serving stack were
not loaded enough at 3 RPS to matter. The stress test, not the code review,
is what named the axis.

With the axis identified, the agents went after the query shapes that were
burning the CPU. The serializer N+1 collapsed by an order of magnitude
(179 → 15 queries at page size 10, byte-identical payload). The paginated
COUNT(*) over a distinct multi-join was the bigger win (~639 ms → ~1.5 ms,
~400× faster, same value). A forced GROUP BY rewritten as a correlated
subquery took the full local view path from about 1590 ms to about 174 ms;
on a warm pod the median dropped from about 2.25 s to about 1.57 s.

Those wins matter as evidence that the DB CPU axis is reducible — that
the binding constraint stress testing surfaced is also a tractable one.
They are not the lesson. The lesson is what the search exposed when it ran
past one endpoint. The system is not bound by a single bottleneck. It is
bound by five, each with a different blast radius, a different toolchain
to investigate, and a different treatment cost:

The DB CPU wall. The 2-vCPU writer hits ~100% long before connections saturate. The surface fixes compress one query class on one endpoint; underneath, dozens of unrelated endpoints share the same query shapes, with the same low-RPS CPU cost. A hotspot inventory pairing pg_stat_statements with EXPLAIN is where the next ten- and twenty-times wins live.
The serving-layer ceiling. Even with the database idle, an in-cluster harness reproduces a ~1k-RPS wall per instance: four GIL-bound gunicorn workers paying a per-request cost. Adding pods does not buy RPS until that ratio improves.
The middleware tax. A cache-hit response has no DB cost and no business logic, yet still measures in tens of milliseconds. That floor — auth resolution, DRF rendering, observability middleware, session and CSRF on routes that need neither — is what the cache-hit ceiling cares about.
Caching strategy. A cache that is not stampede-protected fans out a miss into exactly the DB compute the surface fixes were trying to avoid. The cache only matters if it is reachable cheaply and protected end-to-end.
3rd-party blocking on the request path. Some endpoints still make synchronous calls to external services during request handling. Those endpoints cannot scale linearly with workers — and offloading them is a behavior change that needs product sign-off rather than a perf-only PR.

The Pareto matters because each scenario hits a different axis. The 50K
simultaneous-join burst is dominated by one endpoint and probably goes the
caching plus middleware-bypass route. The sustained 1K livestream uses
six endpoints driven by FE polling cadence and is dominated by the DB CPU
wall and 3rd-party blocking axes; middleware optimization buys it less.
"Make the backend faster" is a wish; "make this scenario's binding axis
cheaper" is a ranked list of PRs.

The interesting part is not any single optimization, or even the cascade of
falsifications and confirmations that produced the surface wins. It is the
shape of what was found underneath: not one bottleneck, but a class of
them, with different scenarios bound by different axes. The agents narrowed
the search space with evidence and let the bottleneck move without defending
the previous theory — which is the only way a system this shape gets mapped
at all.

The Evidence At Scale

One endpoint's hypothesis cascade is a vignette. The harder claim — the
uncomfortable one — is that this kind of work scales: in roughly
thirty-six hours, a small team plus parallel agents did the shape and
volume of investigation that a senior engineer would historically have
spread across a quarter. The most important output is not any single
fix. It is the kinds of work that suddenly become routine.

The Investigation At Scale

The agents read across four production codebases plus a PRD for a system
that does not exist yet — the API backend, the web frontend (to extract
real polling cadences and request fan-out), the AI/livestream service,
and the mobile/edge client paths, alongside the product requirements for
a 50K-person event flow with no working implementation. They produced,
per traffic scenario, the cadences, the at-risk factors, and the
treatments, and turned each into durable artifacts: a capacity model
deriving per-instance RPS from worker count and CPU-per-request; a
DB-CPU hotspot inventory ranking query shapes by cost so the ten- and
twenty-times wins land before the 1.2× ones; a four-scenario event-day
load profile (bursty purchase, 50K simultaneous join, initial-pressure
livestream, sustained livestream — each with a different binding axis);
and a synthesis collapsing dozens of hypotheses into the five axes the
case study named.

That analysis fed a hypothesis pipeline rather than a stack of one-off
optimizations. Roughly sixty tracking tickets opened under the parent
audit — about twenty-five explicit hypothesis tickets (each naming the
smell, the affected endpoints, and a falsifier), about seven paired
data-verification tickets whose only job is to confirm or kill a sibling
hypothesis, and the rest execution sub-projects (per-PR side backend,
read-replica routing, perf budgets, observability automation, DevOps
unblocks). Several stories were confirmed by measured wins of the kind
the case-study table chronicles — orders of magnitude on serializer
fan-out and paginated counts, an order of magnitude on a forced
aggregation — extended to more than one endpoint. Several others were
explicitly killed, which matters more than it sounds: the
Python-CPU-as-primary-limiter story (falsified by a local sweep that
scaled near-linearly); the load-source story for the ~1k-RPS ceiling
(the ceiling is real and lives in the worker count × CPU-per-request
budget); the token-cache premise (raw signature verification was too
cheap for the cache to matter); a codec swap assumed to be a meaningful
win that turned out to be noise. About two dozen hypotheses remain open
in triage, each queued behind its own data-verification ticket. The
point is not that the audit is finished. The point is that the queue
itself — falsifiable, prioritized, paired with a verification recipe —
is the durable output, built in parallel with the work that closed the
first batch. A traditional team builds this queue or builds the fixes;
this team built both.

Treatments At Scale

The shipped treatments span the categories the analysis pointed at
rather than clustering on a single favorite. In the same window:
query-shape rewrites (N+1 collapse, cheap-count pagination, GROUP-BY
elimination); per-request-cost reductions (eager-loading on the auth
path, explicit connection budgeting); platform plumbing (a per-PR
side backend so each branch can be benchmarked in isolation, an
in-cluster load harness, an observability-bundle validator, a fixed-rate
stress client); and methodology infrastructure (a clean perf-PR
pattern, an opt-in perf measurement toolkit, a Django caching playbook,
a per-scenario load-profile doc, a stress-test report for review by
humans who were not in the experiment loop).

Each treatment shipped as a small, byte-identical, single-anti-pattern
PR off main rather than as a bundled lab branch. That discipline is
itself a result. A senior engineer doing all of this serially does not
get to be that disciplined — they bundle, because bundling is how
working memory preserves context across changes. Parallel agents can
afford the discipline, and the team inherits a reviewable trail instead
of a heroic mega-merge. Beyond the PRs themselves, the cycle produced
roughly fifteen durable documents — capacity models, methodology specs,
optimization playbooks, validity gates, worked examples, runbooks, and
coordination logs — so the reasoning lives in the repository instead of
in chat history.

The honest read of this is not that the agents won. It is that the
parts of senior engineering work that used to be scarce — exhaustive
analysis across codebases, parallel falsification of competing stories,
ranked treatment lists with measured magnitudes, and writeups that
survive past the conversation that produced them — are no longer
rate-limited by typing time or by working memory. They are rate-limited
by whether the team sets the goal clearly, defines the falsifier
honestly, and accepts the evidence when it arrives.

That last one is where the bitter part lives. Some of the killed
hypotheses were stories the team had defended. Some of the confirmed
wins were not in anyone's top three guesses. The method does not care
which story was satisfying; it cares which story survives contact with
measurement. Accepting that — and learning to take pride in defining
goals and falsifiers rather than in being the one who also typed every
fix — is what makes the rest of it possible.

The Meta-Method And Architecture

The work succeeded because it was not run as a pile of disconnected chat turns.
It was run as a long-running goal architecture with a shared objective,
specialized agents, explicit containment, and durable state.

The first phase was deliberately constrained: prepare PRs, tests, hypotheses,
data, scripts, runbooks, and reasoning, but do not deploy and do not touch the
shared database. The agents did not burn the target environment just to feel
productive. They built the lab: local provider stubs, fresh-migrate workarounds,
isolated data paths, baseline controls, instrumentation, load tooling, parsers,
runbooks, and a coordination protocol.

After live execution was approved, the operating model shifted from isolation
to controlled target-environment work. The agents imported target-shaped data
into local Postgres, coordinated deployment windows, recorded deployed SHAs,
watched stop conditions, and pushed the system only until the next useful data
point appeared.

When a 3 RPS run saturated the environment, they treated it not as a crisis or
a reason to push harder, but as evidence: wall time times worker slots was
already enough to tip the system over.

That architecture had five essential parts:

A shared objective:
make 500-1000 RPS plausible with effective horizontal scaling, not just make
one benchmark look better.
A falsifiable model:
use Amdahl's law, Little's Law (in-flight work ≈ throughput × latency),
and the Universal Scalability Law to ask
whether throughput is capped by CPU, DB compute, DB connections, event-loop
blocking, external calls, or queueing.
Isolation before risk:
local Docker, an imported target-environment clone, local provider stubs, and
explicit live-execution gates allowed fast iteration without turning the
shared environment into the lab by default.
Agent specialization:
one agent owned app-code endpoint proofs and synthesis; another owned
harnesses, deployment, observability, and scaling sweeps; another owned
independent review and adjacent endpoint proofs. The taskboard and
append-only channel made collisions visible.
Proof over assertion:
changes were paired with query counts, timings, tests, payload hashes,
isolated worktree comparisons, target-environment samples, and explicit
caveats.

Backend optimization is longer than one prompt. It crosses code reading,
experiment design, harness construction, local reproduction, PR preparation,
deploy orchestration, observability capture, measurement correction, endpoint
fixes, and synthesis.

A stateless chat loop tends to forget the global objective or collapse into the
next visible task. Long-running goals kept the agents attached to the durable
outcome and gave them a place to preserve context across hours of asynchronous
code generation, deployment windows, and log analysis.

The operating loop was simple:

State the bottleneck hypothesis in physical terms.
Define the falsifier.
Build the smallest diagnostic.
Measure wall, CPU, DB time, external waits, queueing, errors, and saturation.
Change one lever.
Prove behavior, ideally with byte-identical payloads.
Re-measure the same way.
Update the model when the bottleneck moves.

That last step is the hardest culturally. If the bottleneck moves, that is
success. The point is not to defend the first theory. The point is to make the
next limiter visible.

The Guardrails

Cheap parallel compute is an accelerant, which means bad goal architecture will
amplify engineering errors just as quickly as it amplifies useful work. An
agent running an undisciplined optimization loop will happily DDoS a shared
environment, burn through observability spend, hide a regression behind a
cached response, or hallucinate a benchmarking victory by measuring failures as
throughput.

The human role therefore does not disappear. It moves from "manually perform
every step" to "define the containment boundary." The safety rails are not
optional process decoration; they are what make autonomous exploration usable:
local-first execution, explicit live-execution gates, one load window at a
time, stop conditions, deployed SHA capture, query-count and payload-equivalence
proofs, error-rate thresholds, and append-only coordination logs.

The measurement bug caught in this work is the small version of the same risk:
the first load harness reported achieved RPS in a way that counted failures.
Without experimental guardrails, cheap compute would have made that wrong
number look more convincing. With guardrails, the agents corrected the harness,
reclassified the run, and improved the method.

This method also has boundaries. It is strongest when the system is observable,
the goal is measurable, and improvement can proceed through small,
behavior-preserving changes. It is weaker when the task is green-field
architecture, product judgment, or a novel failure mode with poor telemetry.
There, human architectural taste and risk judgment still lead. The agents can
map the terrain, but they cannot decide what terrain is worth owning.

The Real Lesson

The human contribution is still essential, but it moves up a level. The human
should not spend most of the scarce 30 W on manually remembering every path and
guessing which bottleneck matters. The human should set the objective, enforce
experimental hygiene, choose acceptable risk, and judge the evidence.

The agents should burn compute on the scalable parts:

exhaustive code reading;
independent hypothesis generation;
repeated benchmark runs;
query-plan and metric collection;
before/after equivalence checks;
deployment state verification;
runbook and taskboard maintenance;
synthesis into a capacity model.

This is the backend version of Sutton's lesson: do not over-invest in a human's
favorite explanation of the system. Invest in the meta-process that searches the
system, measures it, and improves as more compute is applied.

The discomfort is the lesson. Mature backend engineering will feel less like
winning arguments about the system and more like building machinery that makes
those arguments unnecessary.

The result may sting. A single senior engineer's intuition is valuable, but it
is not a substitute for three agents running disciplined experiments in
parallel. That is not an insult to engineering judgment. It is what engineering
judgment should choose when compute becomes cheap enough.

Counter Factual
Estimate for an staff engineer — twenty years of wins across many varied fields data engineering, applied AI, MongoDB at scale, video codecs you've actually heard of — none of which necessarily teaches you Django querysets, Postgres profiling, stampede-protected Redis, gunicorn worker math, or how a Kubernetes liveness probe interacts with a saturated worker pool.

Ramp-up to credible-not-expert across that stack: 2–3 months. Investigation work on top of the ramp-up: another 1–2 months. Call it half a year, assuming nothing else catches fire. They'd ship ~10 heroic merged PRs and an oral folklore in their head — because under deadline the writeups are the first thing that gets cut. They'd also miss at least one axis with high probability;

Versus ~36 hours with agents in parallel, all five axes named, the writeups extant. Call it a ~100× compression in wall time, plus the writeups exist at all — which, on a year-long solo project, they wouldn't.

External Reference

Richard Sutton, "The Bitter Lesson", March 13, 2019: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Turning Linear and Figma Into a Searchable Knowledge Layer with `aviadr1/issueclaw` and `aviadr1/figmaclaw`

Aviad Rozenhek — Fri, 03 Apr 2026 12:13:47 +0000

Most teams don’t lack information — they lack a way to search product, design, and engineering context as one system; this article explores what happens when you turn Linear and Figma into searchable markdown and let humans and AI reason across both.

Product intent is in Linear. Design detail is in Figma. Implementation reality is in code. Each system is useful on its own, but the moment you need to answer a question that crosses all three, things get awkward. You open tabs, search manually, ask in Slack, and reconstruct the answer from fragments.

That is tolerable for humans. It is much worse for AI.

At Gigaverse, we wanted something different: not a new source of truth, and not a replacement for Linear or Figma, but a shared text layer that makes product and design context searchable, diffable, and usable by both humans and agents.

That is the role of two repositories:

aviadr1/issueclaw
aviadr1/figmaclaw

Together, they turn Linear and Figma into git-native markdown that can be searched alongside code. Once that exists, a different class of workflow becomes possible.

TL;DR

This article is about a simple idea with surprisingly large consequences:

Mirror Linear and Figma into markdown, keep that mirror up to date, and use it as a searchable working memory layer for your team.

If you read on, you should come away with:

a clear mental model for what issueclaw and figmaclaw do
a practical sense of why this is useful beyond “exporting markdown”
real examples of the kinds of questions this unlocks
a private-by-default way to adopt the same pattern in your own team

The short version is that this makes it much easier to answer questions like:

Where does this feature already exist in design?
What did product actually decide here?
Is this designed on both web and mobile?
Does the design still match the current implementation?
What changed in one system that has not propagated to the others?

That is where the real value starts.

The problem: product knowledge is fragmented by tool boundaries

Modern product development is full of cross-system questions.

Not “what does this ticket say?”
Not “what does this Figma page look like?”
But questions like:

What is the current shape of this feature from spec to design to implementation?
Which design patterns are duplicated or inconsistent?
Is this bug contradicting the spec, the design, or the code?
Which features have been explored deeply in design but barely implemented?
Where are we at risk of building from stale assumptions?

These are not rare edge cases. They are everyday questions.

And most tools are not built to answer them well.

Linear is good at tickets, docs, and status.
Figma is good at visual design and flows.
Git is good at code and history.

The problem is everything that lives between them.

The core idea

The idea behind aviadr1/issueclaw and aviadr1/figmaclaw is straightforward.

issueclaw mirrors the Linear side into markdown
figmaclaw mirrors the Figma side into markdown
that markdown lives in git, alongside the rest of your technical context

Once you do that, you get something new:

Linear stops being only a UI you browse
Figma stops being only a canvas you click through
git becomes a searchable memory layer for product, design, and engineering

That is the key shift.

This is not about replacing the original tools. The source systems remain the source of truth. The markdown mirror exists to make them more searchable, more comparable, and more usable in workflows that span multiple systems.

What `aviadr1/figmaclaw` does

At a high level, figmaclaw turns Figma pages into markdown.

Each page becomes a .md file with two layers:

a structural layer with things like page IDs, frame IDs, flow information, and metadata
a readable layer with summaries, sections, descriptions, and flow representations

That gives you something that is simultaneously machine-friendly and human-friendly.

Machines can diff it, hash it, update it incrementally, and search it cheaply.

Humans can read it, grep it, review it, and paste it into AI workflows without having to browse Figma frame by frame.

In the Gigaverse setup, this created a large text mirror of the design workspace: hundreds of markdown files representing pages across dozens of Figma files. That was enough to turn design discovery into a search problem instead of a browsing problem.

What `aviadr1/issueclaw` does

issueclaw does the complementary job on the Linear side.

It mirrors issues, docs, and related artifacts into markdown so they can live in git as searchable text. That makes product context easier to diff, easier to inspect historically, and far easier to combine with design and code context.

The important point is not just that the data is exported.

It is that once the product layer and the design layer are both present as text, they can be searched together.

That is what unlocks the more interesting workflows.

Why this is more useful than it sounds

If you hear “we exported Figma and Linear into markdown,” it sounds mildly convenient.

That undersells it.

The real benefit is that you get a new reasoning layer across systems that normally do not compose well.

At Gigaverse, this led to a few capabilities that stood out immediately.

1. Full-text search across designs

This was the clearest immediate win.

Once Figma was mirrored into markdown, questions like these became trivial:

Where do polls appear across the product?
Which screens mention notifications?
How much dark mode coverage do we have?
What onboarding UI already exists?

Those are annoying questions in raw Figma. They become easy when design is searchable as text.

This alone is a strong enough use case to justify the mirror.

2. Feature tracing from spec to design to implementation

One of the strongest tests was tracing Gigaverse’s “going live” feature end to end.

By combining Linear docs, Figma markdown, and implementation issues, we could reconstruct:

the intended product behavior
the relevant web and mobile flows
the implementation work across teams
and contradictions between different sources

That is not just convenient retrieval. That is actual cross-system understanding.

3. Reconstructing design decisions

Another strong use case was looking at stage widgets.

The Figma side showed iterations, comparisons, and changing layout ideas. The Linear side showed what had been decided and how implementation tracked those decisions.

That let us answer not just “what screens exist?” but “what was explored, what direction won, and how far did it get?”

That is a much more interesting kind of query.

4. Platform parity analysis

This turned out to be a very practical question:

Which features exist on both mobile and web?
Which are intentionally platform-specific?
Where are the gaps?

That kind of audit becomes much easier once the design layer and the issue layer are searchable in one place.

5. Design system audits

This is another category that gets dramatically better once design becomes text.

Instead of clicking through files manually, you can search for:

empty states
loading states
modal patterns
permission flows
error states

That makes inconsistency much easier to spot.

The most interesting capability: drift detection

The biggest surprise was that the highest-value use case was not search.

It was drift detection.

Once specs, issues, and designs were all searchable together, it became much easier to spot cases where they disagreed.

At Gigaverse, this surfaced real examples of:

contradictory toggle defaults across issues, product docs, and QA expectations
terminology drift between active ticket language and newer design language
design elements that remained in Figma after they had effectively been removed from the product direction
exploratory design states with no matching implementation work

This is where the system stopped feeling like a retrieval convenience and started feeling like infrastructure.

Search saves time.

Drift detection saves teams from bugs, confusion, and rework.

Why not just use Figma MCP directly?

Because the two approaches are good at different things.

Figma MCP is best when you already know what you want to inspect. It is excellent for exact visual truth, precise frame inspection, and prototype details.

The markdown mirror is best when you do not yet know where to look.

That distinction matters.

For broad search, inventory, audits, and AI context-building, markdown is dramatically more efficient. It is faster to search, cheaper to process, and much more compact for LLM workflows than raw API payloads or screenshot-heavy browsing.

The best workflow is not one or the other.

It is:

Use the markdown layer to narrow the search space.
Use Figma MCP only for the specific frames where visual precision matters.

That division of labor turned out to be extremely effective.

How `figmaclaw` stays current

A one-time export would be mildly useful.

What makes figmaclaw interesting is that it is incremental.

At a high level, it works like this:

tracked Figma files are checked for changes
unchanged files or pages are skipped
changed pages are updated
changed frames are the ones that get re-enriched

That means the system does not need to rebuild the entire design workspace every time someone makes a small edit.

This matters for both speed and cost.

It also means the mirror can stay reasonably close to the actual design source of truth without turning every sync into a huge batch process.

Why sync and enrichment are separate

This is one of the smartest parts of the architecture.

There are really two different problems here:

keeping the mirror structurally up to date
making the mirror richly readable

The first part is deterministic and cheap.
The second part is interpretive and more expensive.

So the system treats them separately.

The sync layer updates the structure.
The enrichment layer turns that structure into useful prose.

That is a better architecture than trying to do everything in one pass.

It means the mirror can stay fresh even while enrichment is still running, and it means you can prioritize the highest-value parts of the design corpus first.

In our case, enrichment is currently in progress and the whole mirrored system should be enriched within a few hours. That matters because it removes the biggest remaining weakness we saw: pages that were structurally present but not yet richly described.

What this means for AI workflows

This pattern is especially powerful if you are serious about AI-assisted development.

LLMs do much better when the input is:

textual
structured
scoped
and semantically meaningful

A good markdown page description is a far better input than raw design JSON or a folder full of screenshots.

That means this mirror is useful not just for human search, but for:

implementation assistance
code review context
architectural reasoning
product/design retrieval
and agent workflows that need to move across systems cleanly

This is one of the reasons the approach feels increasingly important. It is not just about convenience for humans. It is about building a better substrate for machine reasoning too.

How to use this privately on your own team

You do not need to make any of this public.

The default setup should be private.

A simple model is:

keep Linear and Figma as your source systems
mirror them into a private git repo with aviadr1/issueclaw and aviadr1/figmaclaw
run sync and enrichment in your own environment
let your team and internal tools search the resulting markdown

A minimal repo might look like this:

product-memory/
  linear/
  figma/

The practical flow is:

create a private repo
install aviadr1/issueclaw and point it at your Linear workspace
install aviadr1/figmaclaw and point it at the Figma files you want mirrored
run the initial syncs
schedule periodic syncs via cron or CI
run enrichment privately and commit the results back into the repo

I have not verified the live README commands for both repositories line by line in this session, so I would use each repository’s current README and CLI help for the exact installation syntax. The important point is architectural: the whole pipeline can remain private end to end.

Who should care

This is especially useful for teams that already have:

serious product work in Linear
serious design work in Figma
fast-moving engineering
and growing use of AI in development or internal tooling

In other words, this is useful anywhere the real answer to a question usually lives across multiple systems rather than inside one.

The takeaway

The real point of aviadr1/issueclaw and aviadr1/figmaclaw is not markdown.

It is that they turn two important systems of record into a shared, searchable knowledge layer.

That layer is useful because it helps humans and AI reason across product, design, and engineering as one system rather than three disconnected tools.

At Gigaverse, that already made search, feature tracing, platform comparison, and design audits much easier.

The deeper win was drift detection.

That is what made the whole approach feel important.

If you are building software with Linear, Figma, git, and AI-assisted workflows, this pattern is worth serious consideration.

It gives your team something most tool stacks are missing:

a working memory that spans the actual shape of product development.

Claude Feels Slow. But Is Moving a Team to Open-Weight Models Actually the Fix?

Aviad Rozenhek — Mon, 30 Mar 2026 22:21:51 +0000

TL;DR

Claude has a real speed problem for our team — but mostly in TTFT, not in raw decoding speed.

I measured our actual usage and found this:

TTFT p50: 4.2s–6.8s
TTFT p90: 14.5s–28.1s
Claude Sonnet decode p50: 176 tok/s

That explains the feeling: Claude often isn’t that slow once it starts, but sometimes it takes so long to begin that the whole thing feels like it’s crawling.

That naturally raises the next question:

Should we move the team to self-hosted open-weight models?

At first glance, that sounds promising. Self-hosted setups can have dramatically better TTFT. In the numbers I looked at, open-weight deployments were often estimated around 150–600ms TTFT, versus Claude’s 4–7s median in our real usage.

But once I looked at the actual team setup — 10 engineers sharing one GPU budget — the answer stopped looking obvious.

The best open-weight models need serious multi-GPU infra, and once that infra is shared, the speed case starts looking surprisingly shaky.

So this post is not “open source bad.”

It’s a narrower question:

If Claude feels slow, is moving a team to open-weight models on shared infra actually the answer?

Right now, I’m not convinced.

The problem: Claude feels like it crawls

This started with a very practical complaint:

Claude is slow.

That could mean a lot of things, so I measured it.

From about 50 session files and roughly 3,000 API calls, the problem was clear: the main issue was TTFT, especially in the tail.

TTFT from our real usage

Trigger	p10	p50	p90
User message	2.8s	6.8s	28.1s
Tool result	2.5s	4.2s	14.5s

That 28.1s p90 is the whole story.

Claude is not just “a bit laggy” there. It’s slow enough to break flow.

The surprising part: decode speed wasn’t the main problem

Here’s the other half of the picture.

Generation speed

Metric	p10	p50	p90
Decode tok/s (excluding TTFT)	72	178	567
Wall tok/s (including TTFT)	23	41	63

And per model:

Model	TTFT p50	Decode p50
Haiku 4.5	1.8s	287 tok/s
Sonnet 4.6	4.2s	176 tok/s
Opus 4.6	4.7s	130 tok/s

So the core problem wasn’t really:

Claude can’t stream fast enough.

It was:

Claude often takes too long to get started.

That distinction matters, because it makes self-hosting sound much more attractive than it might actually be.

Why open weights sound like the obvious answer

If TTFT is the problem, then self-hosting sounds like the clean fix.

The pitch is simple:

no provider-side queue
no shared API congestion
your own inference server
much lower TTFT

And the numbers I collected from the self-hosting side were definitely seductive.

Best-case self-hosted framing

Metric	Claude now	Best self-hosted
Tool-triggered TTFT p50	4,200ms	~160ms
User-triggered TTFT p50	6,800ms	~160ms
Bad-day p90	14,500ms+	<400ms

If TTFT were the only thing that mattered, I think this would already be enough to move seriously toward GPUs.

But TTFT is not the whole developer experience.

The models we’d actually consider

We’re not talking about toy models here. We’re talking about the real open-weight candidates people would actually put on the table.

Models considered

Model	Why consider it?
Qwen3-Coder-Next	Fast MoE coding model, 80B total / 3B active
MiniMax M2.5	Stronger quality candidate, 230B total / 10B active
DeepSeek V3.2	Very large MoE option
Qwen3.5-27B	Dense, simpler, slower but cheaper

And the inference engines are the standard ones you’d expect:

Inference engines

Model family	Realistic inference engine
Qwen / DeepSeek	vLLM or SGLang
MiniMax M2.5	vLLM
Dense smaller models	usually vLLM

That means this isn’t some hypothetical future stack. It’s the standard modern self-hosted inference path.

The part that makes this much less exciting: GPU budgets are shared

This is the piece I think gets hand-waved away too often.

Our current setup is:

Item	Value
Engineers	10
Claude subscription per engineer	$150/mo
Total Claude cost	$1,500/mo

The budget I was willing to entertain for self-hosting was roughly 3× that, so about $4,500/month.

That sounds like a lot.

But for top open-weight coding models, it buys you something like this:

What the budget can buy

Config	Cost/month	Notes
5× H100 on Vast.ai	$4,712	Enough for MiniMax M2.5 / DeepSeek-class INT4
3× H100 on Lambda	$4,521	More reliable, lower GPU count
4× H200 on Vast.ai	$4,153	Better memory bandwidth
8× A100 on Vast.ai	$2,580	Cheapest high-count option

That’s not “10 engineers each get a fast private model.”

That’s one shared cluster.

And that changes the question completely.

The real metric is not TTFT. It’s team step time.

The right equation is not:

lower TTFT = faster experience

It’s more like:

team step time = queueing + TTFT + output_tokens / decode_speed

That’s the part that made me hesitate.

Because once you share one cluster across 10 engineers:

TTFT might improve
but per-user decoding might not
and queueing becomes part of the story

That is a very different situation from “look how fast this benchmark is on one box.”

Why I’m not yet sold

The self-hosted numbers I gathered looked like this:

Self-hosted decode estimates I considered

Model	Config	INT4 decode tok/s
Qwen3-Coder-Next	2× H100	~3,400
MiniMax M2.5	4× H100	~2,000
MiniMax M2.5	2× H100	~1,000
DeepSeek V3.2	5× H100	~700
Qwen3.5-27B	2× H100	~380
Qwen3.5-27B	1× H100	~190

Those numbers are exciting. They make open weights look like a no-brainer.

But they also raise exactly the question I still don’t think I’ve answered cleanly:

Are these the numbers one engineer feels, or the numbers a shared cluster produces in aggregate?

Because for a 10-person team, those are not the same thing.

And once I started looking at the problem through the lens of shared infra, the speed case stopped looking like an obvious slam dunk.

So where does that leave me?

I think I’ve convinced myself of a few things:

What seems true

Statement	My current view
Claude has a real speed problem	Yes
The problem is mostly TTFT	Yes
Self-hosting probably improves TTFT a lot	Yes
The best open-weight models are expensive to run well	Yes
Shared infra weakens the speed story	Yes
Moving the whole team looks obviously promising	No

That’s the interesting part.

The story I expected was:

Claude is slow, open weights are fast, buy GPUs, problem solved.

The story I actually found was:

Claude is slow mostly because of TTFT.
Open weights probably help that.
But once the infra is shared across a team, the speed case gets much less clean.

Bottom line

I started with a very simple frustration:

Claude felt slow.

I measured it and found a very specific issue:

TTFT, especially the p90 tail, was bad enough to make the whole experience feel like it was crawling.

That led to the obvious next idea:

What if we just move to open-weight models on our own GPUs?

And right now, my answer is not “definitely no.”

It’s this:

Open-weight models look promising for TTFT.

They look much less promising as a shared-infra speed fix for a whole team.

That’s the question I’m left with.

Not whether open weights are good.
Not whether they’re possible.
But whether they really solve the problem we actually have.

I Replaced My LangSmith MCP Server with a 200-Token CLI Skill

Aviad Rozenhek — Thu, 05 Mar 2026 01:03:14 +0000

How langsmith-cli gives you 100% MCP parity, 96% less context usage, and features the MCP server doesn't have — all in a single pip install."

If you're using LangSmith with Claude Code (or any AI coding agent), you're probably running the official MCP server. It works. But every session, it injects 5,000+ tokens of tool schemas into your context window — whether you touch LangSmith or not.

I built langsmith-cli to fix that. It's a standalone CLI and a Claude Code plugin that replaces the MCP server with a <200 token skill definition. That's a 95% reduction in context overhead.

And it does more than the MCP server does.

The Problem with MCP Servers

MCP servers are always-on. The moment your agent session starts, every tool definition gets loaded into context. For LangSmith's MCP server, that's 66 parameters across multiple tools — around 5,000 tokens of JSON schema that sits in your context window doing nothing until you actually need to query traces.

For agents that need to do many things — write code, run tests, debug, and occasionally check LangSmith — this is wasteful. Context is your agent's working memory. Every token of schema is a token not available for reasoning.

The Fix: On-Demand Skills Instead of Always-On Schemas

langsmith-cli takes a different approach. Instead of an MCP server, it's a CLI tool with a tiny skill file that teaches your agent how to use it:

# Install the CLI
uv tool install langsmith-cli

# Add as Claude Code plugin
claude plugin marketplace add gigaverse-app/langsmith-cli
claude plugin install langsmith-cli@langsmith-cli

The skill file is ~200 tokens. It loads on-demand. Your agent learns to run shell commands like:

# Get the latest failed run with only the fields you need
langsmith-cli --json runs get-latest --project my-app \
  --failed --fields id,name,error

No schema bloat. No always-on server. Just a CLI your agent calls when it needs observability data.

96% Token Reduction with `--fields`

This is the feature that matters most for agents. A typical LangSmith run object is 20KB — easily 1,000+ tokens. With --fields, you get only what you asked for:

# Full run object: ~1000 tokens
langsmith-cli --json runs get abc-123

# Just what you need: ~40 tokens
langsmith-cli --json runs get abc-123 --fields name,status,error

--fields works on every list and get command: runs, projects, datasets, examples, prompts. Your agent stays lean.

Built for Two Audiences

Most developer tools pick one audience. langsmith-cli serves both:

For humans — rich terminal tables with color-coded statuses, smart column truncation, syntax highlighting:

langsmith-cli runs list --project my-app --status error --last 24h

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Name         ┃ Status     ┃ Tokens ┃ Latency  ┃ Error       ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ extractor    │ error      │ 2,340  │ 3.2s     │ Rate limit  │
│ classifier   │ error      │ 1,102  │ 12.4s    │ Timeout     │
└──────────────┴────────────┴────────┴──────────┴─────────────┘

For agents — add --json as the first flag and everything switches: strict JSON to stdout, diagnostics to stderr, zero formatting noise:

langsmith-cli --json runs list --project my-app --status error --limit 5

One flag. Two completely different UX modes.

Features the MCP Server Doesn't Have

langsmith-cli has 100% parity with the official MCP server (all 66 parameters mapped). But it also has features the MCP server can't offer:

Live Monitoring with `runs watch`

A real-time streaming dashboard in your terminal:

langsmith-cli runs watch --project my-app

One-Command Debugging with `runs get-latest`

No more list | jq | get pipelines:

# Before: three commands piped together
langsmith-cli --json runs list --project X --limit 1 \
  | jq -r '.[0].id' \
  | xargs langsmith-cli --json runs get

# After: one command
langsmith-cli --json runs get-latest --project X --fields inputs,outputs,error

Stratified Sampling with `runs sample`

Build statistically sound eval datasets:

langsmith-cli runs sample \
  --stratify-by tag:length,tag:content_type \
  --dimension-values "short|long,news|gaming" \
  --samples-per-combination 5 \
  --output eval_samples.jsonl

Aggregate Analytics with `runs analyze`

Group-by metrics without leaving the terminal:

langsmith-cli --json runs analyze \
  --group-by tag:model \
  --metrics count,error_rate,p50_latency,avg_cost

Schema Discovery with `runs fields` / `runs describe`

Don't know what fields your runs have? Discover them:

langsmith-cli --json runs fields --include inputs,outputs
# Returns field paths, types, presence rates, even language distribution

Tag & Metadata Discovery

langsmith-cli runs tags --project my-app
langsmith-cli runs metadata-keys --project my-app

Bulk Export with Pattern Filenames

langsmith-cli runs export ./traces \
  --project my-app --roots --limit 1000 \
  --filename-pattern "{name}-{run_id}"

Production Run to Eval Example in One Command

langsmith-cli --json examples from-run <run-id> --dataset my-eval-set

Smart Filtering That Translates to FQL

Nobody wants to write raw Filter Query Language. The CLI translates human-friendly flags automatically:

# These flags...
langsmith-cli runs list --tag summarizer --failed --last 24h --slow

# ...become this FQL:
# and(has(tags, "summarizer"), eq(error, true),
#     gt(start_time, "2026-03-03T..."), gt(latency, "5s"))

Time presets like --recent (last hour), --today, --last 7d, and --since 2026-01-01 all work. Content search with --grep supports regex and field-specific matching. Everything composes.

What's New in v0.4.0

The v0.4.0 release focused on type safety and code quality:

Zero pyright errors — every function has proper type annotations. client: langsmith.Client, not client: Any. Return types are real SDK Pydantic models, not object.
datasets delete command with confirmation prompts and JSON mode support
Improved error handling across prompts and runs commands using specific SDK exception types (LangSmithNotFoundError, LangSmithConflictError) instead of broad except Exception
702 unit tests passing with real Pydantic model instances (no MagicMock for test data)

Getting Started

# Install
uv tool install langsmith-cli
# or: pip install langsmith-cli

# Authenticate
export LANGSMITH_API_KEY="lsv2_..."
# or: langsmith-cli auth login

# Start exploring
langsmith-cli runs list --project my-app --last 24h
langsmith-cli --json runs get-latest --failed --fields name,error

If you're using Claude Code, add the plugin for the best agent experience:

claude plugin marketplace add gigaverse-app/langsmith-cli
claude plugin install langsmith-cli@langsmith-cli

The code is MIT licensed and on GitHub: gigaverse-app/langsmith-cli

If you're building with LangSmith and tired of context-heavy MCP servers, give it a try. Happy to hear feedback in the issues.

Using AI to Audit Engineering Team Output: An Experiment with Claude

Aviad Rozenhek — Mon, 19 Jan 2026 22:47:46 +0000

In engineering management, few challenges are as persistent and contentious as objectively measuring team performance. How do we assess velocity, output quality, and even seniority in a way that is fair, insightful, and resistant to gaming? Traditional metrics—such as story points, commit counts, or lines of code—often fall short. They can incentivize quantity over quality, reward activity rather than impact, and fail to capture the nuanced reality of production software development.

Recently, I conducted an experiment to explore a novel approach: using a large language model (Claude) as an impartial "auditor" to evaluate a codebase over several months. The goal was to answer fundamental questions:

What was actually delivered?
How complex was the work in reality?
How long did it take compared to expectations for a senior engineer?
How stable was the resulting output?

The results were thought-provoking—and somewhat unflattering—prompting deeper reflection on whether AI can (or should) play a role in performance assessment.

The Core Idea: An AI-Powered Engineering Audit

The experiment centered on a detailed system prompt designed to guide Claude through a structured audit process. Rather than relying on superficial signals like commit volume, the prompt instructs the model to:

Identify distinct deliverables by analyzing code changes and timelines.
Evaluate true complexity by examining architecture, dependencies, novel logic, and integration challenges.
Compare actual development time against benchmarks for senior engineers.
Assess stability through regression patterns, hotfixes, and rework.
Quantify overall efficiency and highlight red flags.

Importantly, the prompt emphasizes reading the actual code, not just commit messages, to avoid bias from optimistic descriptions.

Here is the full prompt I used: GitHub Gist.

The prompt is divided into six phases:

Deliverables Inventory – Catalog what was built.
True Complexity Assessment – Classify each deliverable (TRIVIAL to HIGHLY COMPLEX) with evidence.
Time Efficiency Analysis – Expected vs. actual timelines.
Quality & Regression Assessment – Stability and post-ship issues.
Breadth vs. Depth Analysis – Distribution of new features, improvements, maintenance, and rework.
Honest Assessment – Overall efficiency, rework rate, and verdict.

Running the Experiment

I applied this prompt to a real production codebase spanning approximately four months of work by a small team. The audit produced a comprehensive report with tables, evidence excerpts, and quantitative summaries.

Key high-level findings:

Most deliverables were rated as MODERATE or lower complexity, even those that felt substantial during development.
Several features showed timelines 2–10x longer than the prompt's senior-engineer benchmarks.
Rework (fixes, hotfixes, reverts) consumed a significant portion of total effort.
The overall efficiency calculation suggested room for substantial improvement compared to a "well-functioning" baseline.

Notably, the assessment appeared conservative—often downgrading apparent complexity by emphasizing reusable patterns, existing infrastructure, and boilerplate. For instance, systems involving real-time processing and multiple external integrations were classified as COMPLEX but with expected senior timelines of 1–2 weeks, despite months of iteration in practice.

Reflections: Does AI Underestimate—or Cut Through Bias?

The results raised immediate questions about the reliability of AI-driven audits.

On one hand, LLMs may systematically underestimate complexity in real-world systems. Production engineering involves hidden challenges that models struggle to appreciate fully:

Iterative policy refinement and edge-case handling.
Integration debt with legacy systems or third-party APIs.
Coordination overhead in distributed teams.
The intangible cost of context-switching and debugging in complex environments.

On the other hand, the audit's critical tone might reflect a valuable strength: cutting through human bias. Engineers (myself included) naturally overestimate the novelty and difficulty of our work. An impartial observer—especially one trained on vast codebases—can identify patterns, leverage points, and simplifications that feel groundbreaking internally but are standard externally.

Broader Implications and Open Questions

This experiment touches on several important topics in engineering leadership:

Measurement Practices: How do most organizations currently evaluate velocity and quality? Many rely on proxies (velocity in story points, DORA metrics, cycle time), but few attempt deep code-level audits. Could structured AI analysis complement these?
Seniority Calibration: The prompt uses senior-engineer time estimates as anchors. Is this fair? Seniority varies widely—what one engineer completes in days, another might require weeks due to domain knowledge gaps.
Ethical Considerations: Automating performance assessment raises concerns around fairness, transparency, and morale. If an AI labels output as "inefficient," how do we ensure the evaluation is accurate and contextualized? Should such tools ever influence compensation or promotion?
AI Limitations and Improvements: Current models excel at pattern matching but may miss subtle production realities. Future iterations could incorporate more signals (e.g., test coverage depth, production incident data, user impact metrics).

Have you experimented with LLMs for code review, retrospective analysis, or performance measurement? What pros and cons have you encountered?

Conclusion

Using Claude to audit engineering output proved to be a revealing—if humbling—exercise. While the assessment seemed to underestimate certain complexities, it highlighted areas for process improvement and forced a more honest reckoning with deliverables.

I share this not to prescribe AI audits as the solution, but to spark discussion. In an era where AI is transforming how we write code, perhaps it can also help us better understand the code we write—and the teams that write it.

What are your thoughts? How do you measure engineering effectiveness today, and where do you see AI fitting in (or not)?

Thanks for reading. Feedback welcome in the comments.

engineering #productivity #management #ai

LangSmith CLI: Not Just Efficient — Actually Better Than MCP

Aviad Rozenhek — Thu, 15 Jan 2026 00:13:07 +0000

LangSmith gives us incredible visibility into LLM applications: full traces, datasets, prompt versioning, evaluations — everything we need to build reliable AI systems.

But actually using LangSmith day-to-day has always felt clunky:

Constantly refreshing the web UI
Writing custom API scripts
Or using MCP servers that quietly eat 16,100 tokens of context — permanently.

I built langsmith-cli to solve this properly.

It's not only dramatically more efficient (177× less context overhead).

It is fundamentally better for real debugging, analysis, and production monitoring workflows.

Here’s why — with real measurements and concrete examples.

1. Context Is Precious — 177× Less Waste

Straight from /context in Claude Code:

MCP LangSmith tools → 16,100 tokens always loaded (~8% of 200k context)
langsmith-cli as Skill → 91 tokens only when activated, 0 when idle

→ 177× difference in context overhead.

This is not theoretical.

Every extra 10–20k tokens of tool definitions means less room for:

conversation history
source code
documentation
actual reasoning

Add 2–3 more MCP servers → 20–30% of your context disappears before you start working.

2. Real-time Production Monitoring — `runs watch`

The single feature that made me never want to go back:

langsmith-cli runs watch --project production

You get an auto-refreshing, color-coded terminal dashboard:

Live status (🟢 / 🔴)
Latency, token usage, relative time
Instant visibility into error rate and average performance
Filter on the fly: --failed, --slow, --model gpt-4, --tag customer-facing

No browser refresh. No delay.

You literally see production break (or recover) in real time.

MCP + web UI simply cannot match this immediacy.

3. Powerful, Developer-first Filtering

Finding the right runs should not require writing custom code every time.

Examples that MCP/web simply cannot do easily:

# Regex on run names
langsmith-cli runs list --name-regex "^api-v[0-9]+\.[0-9]+"

# Wildcard + smart presets
langsmith-cli runs list --name-pattern "*auth*" --failed --today

# Time ranges (very natural syntax)
langsmith-cli runs list --since "1 hour ago"
langsmith-cli runs list --last 24h
langsmith-cli runs list --since "2025-12-01" --until "2025-12-02"

# Expensive / slow runs
langsmith-cli runs list --min-tokens 8000 --slow --today

These filters are fast, composable, and — most importantly — stay in your terminal flow.

4. Field Pruning: 95% Token Reduction on Responses

A complex multi-agent trace can easily be ~4,200 tokens.

Fetching 10 failed runs full → ~42k tokens just for data.

With --fields:

langsmith-cli --json runs list --failed --limit 10 --fields name,error,latency,status

→ ~214 tokens per run instead of 4,210

→ ~95% reduction

You only pay for the information you actually need.

MCP always returns the complete object. Every time.

5. Dual Excellent UX — Humans + Agents

# Human mode (beautiful rich table)
langsmith-cli runs list --project production --limit 8

→ Color-coded, aggregates, relative times, clean formatting

# Agent / script mode (strict, minimal JSON)
langsmith-cli --json runs list --failed --fields name,error,latency --limit 20

One tool. Two perfect interfaces.

No compromises.

6. Export Formats That Actually Help Teams

--format csv → Excel, pivot tables, stakeholder reports
--format yaml → configs, reproducible environments
--json → agents, automation, monitoring pipelines

langsmith-cli runs list --failed --today --format csv > failed-runs-today.csv

Open → analyze → share. Done.

7. Unix Philosophy — Full Composability

# How many timeout errors today?
langsmith-cli --json runs list --failed --today \
  | jq '.[] | select(.error | contains("timeout"))' \
  | wc -l

# Top 5 most common errors
langsmith-cli --json runs list --failed --limit 200 \
  | jq -r '.[] | .error' \
  | sort | uniq -c | sort -rn | head -5

This is where CLI completely outclasses MCP + web.

You already know these tools.

You already have the scripts.

Now they work with LangSmith too.

Quick Start (Really 30–60 Seconds)

# Install (isolated, safe, works everywhere)
curl -sSL https://raw.githubusercontent.com/langchain-ai/langsmith-cli/main/scripts/install.sh | sh

# Or faster with uv:
uv tool install langsmith-cli

# Add as skill in Claude Code
/plugin marketplace add gigaverse-app/langsmith-cli

# First login
langsmith-cli auth login

Then try:

langsmith-cli runs watch --project production
# or
langsmith-cli runs list --failed --today

Final Verdict

langsmith-cli is not just "lighter" than MCP.

It is objectively better at the things that matter most when debugging and operating LLM systems in production:

Real-time visibility
Powerful filtering without code
Massive context & token savings
Beautiful human UX + perfect machine UX
Export formats teams actually use
Full Unix-style composability

177× less context overhead is nice.

But being able to watch production live, find problems in seconds, and export meaningful data instantly — that's why I built it, and why I never want to go back.

Give it 60 seconds.

Run /context before and after.

The numbers don't lie.

Repo → https://github.com/gigaverse-app/langsmith-cli (MIT)

Happy (much faster) debugging!

Aviad

LangSmith #LLM #Observability #AIDevTools #ClaudeCode

LangSmith CLI Why Lightweight Skills Crush Heavy MCP Servers (Context Is All You Need)

Aviad Rozenhek — Wed, 14 Jan 2026 23:47:47 +0000

TL;DR

Measured reality in Claude Code sessions:

MCP LangSmith tools → 16,100 tokens always loaded (≈8% of 200k context)
langsmith-cli as Skill → 91 tokens when activated, 0 tokens when idle
Difference: ×177 less context overhead

Installation: 30 seconds vs typical 15+ minutes

Field pruning: up to 95% token reduction on responses

Startup: 43–87 ms cold/warm

Skills win for the majority of stateless AI tooling operations.

The Context Tax – Measured Reality

Right now, in my Claude Code session, the LangSmith MCP tools are consuming:

MCP tools · /mcp
├ mcp__langsmith__run_experiment     3.2k tokens
├ mcp__langsmith__push_prompt         2.8k tokens
├ mcp__langsmith__fetch_runs          2.2k tokens
...
└ mcp__langsmith__get_prompt_by_name    146 tokens
TOTAL: 16,100 tokens   (≈8% of 200k context window)

These definitions are permanently loaded — even if I never touch LangSmith during the entire conversation.

The same functionality implemented as a Skill (subprocess-based CLI):

Skills · /skills
├ commit-commands:clean_gone          46 tokens
├ agent-sdk-dev:new-sdk-app           19 tokens
...
TOTAL when activated: 91 tokens   (0.045% of context)
Inactive: 0 tokens

177× difference.

Not an estimate — actual numbers from /context command.

Why Does This Matter? Context Economics

Item	Price (Claude Opus 4.5)	Impact of losing 16k tokens
Input tokens	$15 / million	~$0.24 per query just overhead
200k context window	shared resource	8% permanently occupied
3 typical MCP servers	~36–48k tokens	18–24% of context gone
Freed context (35k+ tokens)	—	≈30 pages docs / 500+ LOC / long conversation history

The more MCP servers you add, the faster your effective context window shrinks — before any real work begins.

Architectural Comparison: Persistent vs On-demand

Aspect	MCP Servers	Skills (subprocess CLI)
Loading moment	At application start	Only when explicitly activated
Context occupation	Permanent	Temporary + very small
Startup time (measured)	Usually 1–3+ seconds	43–87 ms
Resource consumption	Persistent process	Starts → works → exits
Lifecycle management	Required (start/stop/restart/debug)	None
Installation complexity	Medium–high (config, env vars, debugging)	Very low (curl / uv tool)
Composability	Limited (JSON only)	Excellent (Unix pipes friendly)
Output control	Full objects always	Field pruning + multiple formats

Most AI tooling operations are stateless queries

→ list, get, create, update, export

→ They don't need persistent connections, pools, watchers, or bidirectional streaming.

Added Value of langsmith-cli (Beyond Context Efficiency)

Aggressive field pruning

Full Run object ≈ 4.2k tokens

Pruned (name, error, latency, etc.) ≈ 200–300 tokens

→ ~90–95% reduction
Multiple output formats

--json, --format csv, --format yaml
Human-friendly + agent-friendly dual UX

Rich tables when interactive, clean JSON when piped
Advanced filtering presets

--failed, --slow, --today, regex/wildcard on names, etc.
Live watching TUI

langsmith-cli runs watch --project production

Real Numbers from Real Session (Debug Example)

Task: Find failed runs from last hour + show error messages

Skills version

Context cost: 91 tokens (skill definition)

Response: ≈500 tokens (pruned fields, 5 runs)

Total ≈ 591 tokens

MCP version

Context cost: 16,100 tokens (always)

Response: ≈2,000 tokens (full objects)

Total ≈ 18,100 tokens

→ ×30.6 more context for the same information

Installation – 30 Seconds vs 15+ Minutes

Recommended (Skills):

# One-liner (creates isolated venv, adds to PATH)
curl -sSL https://raw.githubusercontent.com/gigaverse-app/langsmith-cli/main/scripts/install.sh | sh

# Then in Claude Code
/plugin marketplace add gigaverse-app/langsmith-cli

Typical MCP path:

pip install langsmith-mcp-server
manual editing of config.json
setting env variables
debugging python path / permissions / port conflicts
restart client
check logs... → frequently 15–40 minutes

When MCP Still Makes Sense (Fair Comparison)

Use MCP servers when you really need:

persistent expensive state (connection pools, large in-memory caches)
background processing (file watchers, long-polling)
bidirectional streaming
very heavy initialization (5GB+ ML models)

For 90–95% of current LangSmith / tracing / evaluation use-cases → skills are superior.

Quick Start – Measure It Yourself

# Install CLI
curl -sSL https://raw.githubusercontent.com/gigaverse-app/langsmith-cli/main/scripts/install.sh | sh

# Add as skill in Claude Code
/plugin marketplace add gigaverse-app/langsmith-cli

# See the dramatic difference
/context

Repo: https://github.com/gigaverse-app/langsmith-cli

( MIT license – contributions welcome )

Context is the most precious resource in long-context LLMs.

Don't waste it on infrastructure that can be replaced with an 80-millisecond subprocess call.

Try the skills approach.

The numbers don't lie.

Happy (much lighter) hacking!

Aviad

Enhancing Claude Code: Turning a Brilliant Showcase into a Modular Plugin Marketplace

Aviad Rozenhek — Sun, 11 Jan 2026 17:31:16 +0000

I'm really excited about Claude Code from Anthropic—it's an impressive way to bring AI assistance deeply into your development workflow. Recently, thanks to a recommendation from @0x0SojalSec, I discovered @ChrisWiles' excellent repository: claude-code-showcase.

With nearly 4,000 stars, the original repo is a wonderful resource. It clearly demonstrates how to configure Claude Code with hooks for automation, skills for project conventions, agents for tasks like code reviews, slash commands for workflows, and GitHub Actions for scheduled maintenance. It's a fantastic guide for anyone looking to make Claude a true team member.

That said, while it's perfect for learning, adopting it often means copying large portions into your own projects. To make it even more accessible, I forked it and added a modular structure—a plugin marketplace that lets you install only the components you need.

My fork: aviadr1/claude-code-showcase

(It's brand new as of January 11, 2026—feel free to star it if you find it useful! 🚀)

Why Claude Code Is So Exciting

For those new to it: Claude Code lets you configure an AI agent with persistent project memory (via CLAUDE.md), specialized skills, automated hooks, dedicated agents, and integrations with external tools. It goes beyond simple autocompletion to help enforce standards, review code, update dependencies, sync documentation, and manage tickets.

The original showcase illustrates these capabilities beautifully. My goal was to reduce the barriers to real-world use.

Key Improvements in the Fork

I preserved all the great features from the original and added several enhancements:

Plugin Marketplace Architecture The repository now functions as a marketplace for installable components. Symlinks ensure a single source of truth—updates in one place automatically apply everywhere, with no need for build steps or sync scripts.

Installation is straightforward:

   /plugin marketplace add aviadr1/claude-code-showcase

   # Then install what you need
   /plugin install testing-patterns@claude-code-showcase
   /plugin install code-review-suite@claude-code-showcase
   /plugin install github-actions@claude-code-showcase

13 Modular Plugins Choose from individual skills, bundled sets, commands, and hooks.

| Plugin | Type | Description |
|-----------------------|---------|---------------------------------------------------|
| testing-patterns | Skill | Jest, TDD, factory functions, and mocking |
| systematic-debugging | Skill | Structured 4-phase debugging approach |
| react-ui-patterns | Skill | Handling loading, error, and empty states |
| formik-patterns | Skill | Form validation best practices |
| graphql-schema | Skill | Queries, mutations, and codegen patterns |
| core-components | Skill | Design system and component library usage |
| pr-toolkit | Bundle | PR review, summary, and GitHub workflow agent |
| code-review-suite | Bundle | Dedicated reviewer agent + code quality command |
| ticket-workflow | Bundle | JIRA/Linear integration with ticket commands |
| docs-sync | Command | Automated documentation consistency checks |
| skill-activation | Hook | Suggests relevant skills based on your prompt |
| plugin-marketplace | Skill | Guides you on building your own marketplace |
| github-actions | Bundle | Reusable CI/CD workflows (detailed below) |

Reusable GitHub Actions Workflows A major addition: workflows that use GitHub's workflow_call feature, so any repository can reference them directly with just a few lines of YAML.

Example caller workflow:

   # .github/workflows/claude-pr-review.yml
   name: Claude PR Review
   on:
     pull_request:
       types: [opened, reopened]
     issue_comment:
       types: [created]
   jobs:
     review:
       uses: aviadr1/claude-code-showcase/.github/workflows/showcase-pr-review.yml@main
       secrets:
         ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Available workflows:

showcase-pr-review.yml → Automated PR reviews and @claude responses
showcase-docs-sync.yml → Monthly documentation vs. code checks
showcase-nodejs-code-quality.yml → Weekly quality sweeps with auto-fix PRs
showcase-nodejs-dependency-audit.yml → Bi-weekly dependency updates with testing

These are fully configurable and integrate with anthropics/claude-code-action@beta.

Additional Polish Intelligent prompt-based skill suggestions, deeper MCP integrations (JIRA, Slack, GitHub), LSP for real-time code insights, and more—all while keeping the modular design.

Why These Changes Matter

Easier Adoption: Install just one skill or the full suite—no need to copy everything.
Always Up-to-Date: Symlinks and reusable references keep your projects current.
Flexibility: Mix and match components to fit your team's needs.
Sharing the Approach: The marketplace plugin itself documents how others can apply this pattern.

This evolution makes powerful AI-assisted workflows more approachable for everyday projects.

How to Get Started

Add the marketplace:

   /plugin marketplace add aviadr1/claude-code-showcase

Explore and install:

   /plugin
   /plugin install <name>@claude-code-showcase

For GitHub Actions: Use the example YAML above and add your Anthropic API key as a secret.

Full fork: https://github.com/aviadr1/claude-code-showcase

Original (definitely worth starring!): https://github.com/ChrisWiles/claude-code-showcase

A big thank you to @chriswiles for the inspiring foundation and to @0x0SojalSec for introducing me to it. I'm enthusiastic about where AI tools like this are taking development—let's keep building great things together! 🚀

Tags: #claude #anthropic #ai #productivity #githubactions #devtools #coding

Modding Games: The Masochist's Magnum Opus, Where AI Meets Its Match

Aviad Rozenhek — Fri, 19 Dec 2025 22:00:16 +0000

In PC gaming, modding isn't just a hobby—it's a rite of passage, a digital Sisyphus pushing boulders of code up endless hills of incompatibility. Picture legions of enthusiasts for Skyrim, Fallout, Minecraft, or any moddable title: they spend weeks curating load orders, only for a single misplaced patch to summon crashes from the void. It's not frustration; it's culture. A proud masochism where tweaking trumps triumphing, and victory is measured not in playtime, but in stable saves. Forums pulse with arcane scrolls—Reddit's r/skyrimmods decoding conflict tabs like hieroglyphs, Nexus guides whispering "load UI last" as gospel. Modders aren't playing the game; the game plays them.

The gauntlet is legendary. Load orders: A top-to-bottom tyranny where later mods overwrite earlier ones, turning your launcher into a gladiatorial arena. One texture pack too high? Your dragon's scales turn into glitchy soup. Conflicts: Invisible file wars, revealed only by tools like xEdit or LOOT, demanding hours of manual arbitration. Patches: Brittle ceasefires between rivals, often outdated after a dev patch, vanishing into "removed for guidelines" purgatory. And the research? A Sisyphean odyssey through Steam's dynamic hellscape—flaky update dates, regional locks, "incompatible" flags on freshly-maintained mods. One wrong click, and your 100GB setup bricks; delete and rebuild, or perish. Yet modders thrive here, trading war stories on X: "One wrong mod and the game freaks out," or "Mod loaders make it harder."

Enter the AI savior—or so we hoped. Desperate for a merged modlist, I unleashed ChatGPT 5.2 in PRO mode on two sprawling setups. Equipped with browsing tools, it dove in: "Researching compatibility... verifying 4.2.4... Steam rate-limited... server disconnect... ACOT removed? UIOD patched?" Forty-one minutes of "Pro thinking" later—three retries, version whiplash (3.13 to 4.2.4), phantom removals, load order flip-flops—it surrendered. Never merged. Never stable. Even Reddit mocks such attempts: "Let AI handle load order? Issues incoming." X devs lament users blindly following AI "optimizations" that shatter games.

This isn't mere failure; it's revelation. Modding embodies unstructured chaos: tribal docs scattered across wikis/forums, dependency graphs implied not explicit, platforms mutating (Steam's CSS-hidden "removals," dynamic JS walls). AI excels at patterns, but mod hell is folklore—unwritten rules from "load patches last" etched in comment sections. GPT looped because it couldn't grok the human element: modders' intuition from 50 crashes, the "just works" heuristics no dataset captures.

Here's the thoughtful crux: Modding is the true intelligence test. AGI arrives not with eloquent essays, but when it ingests a Workshop page, parses changelogs via regex sorcery, simulates overrides in a virtual engine, predicts conflicts from file hashes, and spits a flawless load order—complete with "new game required" flags. Until then, AI remains a noob in the arena, while modders reign as wizards.

Embrace it. Modding's joy is the struggle—the god-complex of resurrection after CTDs. We mod for hours, play for minutes, crash, repeat. It's harder than the game, infinitely more rewarding. AI? Keep dreaming. The culture endures. 🚀💀

I Just Did 6 Weeks of Principal-Level Engineering in 3 Days

Aviad Rozenhek — Tue, 11 Nov 2025 01:10:57 +0000

I just merged a pull request that Gemini estimated would take 4 to 6 weeks (20-27 person-days) of "exceptionally high skill" (Principal Engineer level) work.

It was a massive re-architecture of a core, stateful service. The final PR involved:

~4,000 lines for a new non-blocking scheduling strategy that maintains a budget of operations.
~8,300 lines for a complete test suite rewrite, including a new simulation harness.
~3,500 lines of new engineering standards and design documentation.

I delivered it in 3 days.

Here's the most interesting part: I’m not sure I personally wrote a single line of code in that PR. And yet, it is very much my work.

This is a case study in the new role of the AI-augmented engineer.

The Challenge: The "Impossible" Bug

We were tackling a challenging problem in a complex, stateful system—the kind of scheduling and resource-budgeting bug that is notoriously difficult to reproduce, let alone fix. You can't just run the service and "see" the bug; it only appears at scale, under specific, hard-to-predict race conditions.

The old way to solve this would have been weeks of logging, debugging, and writing brittle, implementation-specific tests, all while the production system continues to have issues.

The New Workflow: Engineer as Director

Instead of writing code, my role shifted entirely to that of an architect, director, and real-time reviewer. I was in a high-intensity "flow state" with my AI partner, Claude Code, over 123 commits.

My job wasn't to type; it was to demand.

I demanded architectural purity: I tasked the AI with refactoring the entire system, decomposing it into pure functional components and a small, imperative core.
I demanded testability: I had the AI build an expressive simulation harness before we even fixed the bug. This allowed us to create the conditions of the bug in a local, deterministic test environment.
I demanded quality (The "Guardrails"): I set up strict guardrails. I configured pyright for static type checking and wrote CLAUDE.md files with explicit instructions for the AI on how to handle errors and write code that meets our standards.
I demanded iteration: As the simulation harness ran, I learned more about the problem. I had the AI redesign the scheduling and budgeting logic twice as my understanding improved—something that would have been cost-prohibitive to do manually.

My focus was on commit-level code review, reflecting on pitfalls, and continuously raising the bar for what I expected. The AI handled the generation, but I was the gatekeeper of quality and design.

The Result: Weeks to Days

The AI-driven workflow didn't just 10x our speed; it fundamentally changed the quality and scope of the solution.

A human engineer (even a team) would have spent 3-5 days just designing the new test harness before writing a single line of code. We built the new architecture, the professional-grade simulation, and fixed the bug in that same timeframe. We found and fixed bugs in our simulation that would have taken months to pin down in production.

This is the power of this new partnership: it’s not just about writing code faster. It's about having an AI partner that can instantly scaffold an entire high-quality architecture, a complete test suite, and all the documentation, allowing you—the human expert—to operate purely at the level of design and strategy.

You just have to know what to ask for.

Self-Improving AI: One Prompt That Makes Claude Learn From Every Mistake

Aviad Rozenhek — Fri, 07 Nov 2025 23:41:38 +0000

I'm going to share a magical prompt and a beautiful structure for CLAUDE.md that makes Claude Code better every time it makes a mistake.

The Untapped Potential

Most repositories have a CLAUDE.md file—it's standard practice. But here's what we're missing: we have thousands of tokens of cognition at the start of every session. Why treat CLAUDE.md like static documentation when we could turn it into a self-improving system?

The breakthrough is making it trivially easy for the AI to improve itself continuously, using that abundant reasoning capacity to compound learning over time.

Quick Links

Where Human and AI Cognition Should Focus

Here's the key insight about working with AI. Human cognition is best used for critical thinking, spotting mistakes, preventing repeating patterns, laying down clearly what we're trying to achieve and why, and setting guardrails to ensure work is done correctly.

AI cognition, relative to human time and cognitive load, is best used for executing on well-defined instructions, analyzing patterns from recent context, writing structured documentation, and maintaining consistency across large codebases.

As fast-paced engineers, our cognitive load is high and our time is limited. Having the very enthusiastic Claude execute on our guardrails and self-improve through the system we offer makes the work become self-improving with compounding benefits.

The Innovation: Two Simple Ideas

What makes this work is the combination of two deceptively simple ideas.

First, a structure that teaches the AI how to teach itself. Most CLAUDE.md files just list rules. Mine includes meta-rules about how to write rules. This means when Claude adds new content, it automatically maintains quality and consistency. The document doesn't just grow—it grows well.

Second, a single prompt that transforms every mistake into permanent learning. When Claude makes a mistake, instead of just fixing it and moving on, I use one sentence that triggers an entire self-improvement cycle. Claude reflects on what went wrong, abstracts the general pattern, and writes it down following the meta-rules it just read.

Let me show you how each piece works.

The Meta-Rules: Teaching AI How to Learn

The real innovation is in the META section of CLAUDE.md. This section exists for one purpose: to teach Claude how to write good rules when it makes mistakes.

Here's what mine looks like:

## META - MAINTAINING THIS DOCUMENT

### Writing Effective Guidelines

When adding new rules to this document, follow these principles:

**Core Principles (Always Apply):**
1. Use absolute directives - Start with "NEVER" or "ALWAYS"
2. Lead with why - Explain the problem before the solution (1-3 bullets max)
3. Be concrete - Include actual commands/code
4. Minimize examples - One clear point per code block
5. Bullets over paragraphs - Keep explanations concise

**Optional Enhancements (Use Strategically):**
- ❌/✅ examples: Only when the antipattern is subtle
- "Warning Signs" section: Only for gradual mistakes
- "General Principle": Only when abstraction is non-obvious

**Anti-Bloat Rules:**
- ❌ Don't add "Warning Signs" to obvious rules
- ❌ Don't show bad examples for trivial mistakes
- ❌ Don't write paragraphs explaining what bullets can convey

Think about what this does. Every time Claude reads CLAUDE.md at the start of a session, it learns not just your project's rules but how to write new rules. When it makes a mistake later in that session and you trigger the reflection prompt, Claude already knows the format to use, when to add detail versus keep it brief, and how to avoid bloat.

This is the key that unlocks continuous self-improvement. Without meta-rules, Claude would add verbose, inconsistent content that degrades the document over time. With meta-rules, Claude self-regulates. It asks itself "Should I add a 'Warning Signs' section here?" and checks the meta-rules to decide. The quality of what it writes compounds rather than degrades.

The meta-rules also include a simple instruction that whenever Claude adds a new rule to the detailed sections, it must update the summary section at the top. This creates a two-tier structure where Claude can quickly scan absolute rules at session start, then reference detailed sections while writing code. Adding a new rule becomes frictionless—one line in the summary, one detailed section following the meta-rules.

The Magic Prompt: One Sentence Changes Everything

When Claude makes a mistake, after correcting it, I use this prompt:

"Reflect on this mistake. Abstract and generalize the learning. Write it to CLAUDE.md."

That's it. One sentence. But look at what happens when Claude processes this instruction.

Reflect tells Claude to analyze what went wrong and why, not just acknowledge the correction. Claude has perfect context—the mistake is right there in working memory with all the surrounding code. This reflection captures nuances that would be lost if you tried to document it manually later.

Abstract tells Claude to extract the general pattern from the specific instance. If Claude patched a logger, the abstraction isn't "don't patch logger" but "don't patch widely-used infrastructure." This is where the AI's pattern recognition shines—it can see the underlying principle.

Generalize tells Claude to create a reusable decision framework. Not just a rule, but guidance on how to think about similar situations in the future. "When you see X, ask yourself Y."

Write it to CLAUDE.md triggers Claude to follow all the meta-rules it read at session start. Use NEVER or ALWAYS. Lead with why. Keep it concise. Update the summary. All of this happens automatically because the meta-rules have already set the guardrails.

You've automated an entire learning cycle with one sentence. Claude does the execution work—analyzing, abstracting, documenting, maintaining format. You did the critical thinking—spotting that a pattern exists worth capturing.

Why This Creates Compounding Improvement

The magic is in the compounding loop this creates. Session one, Claude makes three mistakes. You use the prompt three times. Three new rules get added to CLAUDE.md. Five seconds of your time per rule.

Session two, Claude reads those rules at startup. It doesn't make those three mistakes anymore. Instead it makes new, more sophisticated mistakes. You capture those. Five seconds per rule again.

Session three, Claude reads all the accumulated rules. The basic mistakes have vanished. Now you're having discussions about architectural trade-offs instead of fighting about whether imports go at the top of files.

The mistakes evolve upward. This is exactly what you want from a learning system. You're not eliminating mistakes—you're elevating the conversation to increasingly sophisticated levels.

And here's what makes it sustainable: the meta-rules ensure that as the document grows, quality doesn't degrade. Claude self-regulates based on the guidelines you set once. The document maintains consistency automatically. You're not manually editing every addition—Claude enforces its own standards.

The Deep Insight: Automation Within Automation

What makes this work at a fundamental level is understanding the economics of human versus AI cognition. Your time and cognitive load are the scarce resources. AI execution capacity is abundant relative to that.

Traditional documentation is expensive because it consumes scarce human time. Writing it is expensive. Maintaining it is expensive. Keeping it consistent is expensive. So documentation often doesn't happen or becomes stale.

But with AI, execution is abundant. Claude is going to read CLAUDE.md at every session start anyway—that cognition is already being spent. Claude can analyze patterns in milliseconds. Claude can write structured text faster than you can read it. So why not put that abundant capacity to work?

The reflection prompt does exactly this. You spend five seconds providing the critical thinking—spotting that a pattern exists worth capturing. Claude spends its abundant execution capacity doing the analysis, abstraction, and documentation. The marginal cost of improvement drops to nearly zero, so improvement happens constantly.

You're creating automation within automation. The AI uses its own reasoning to make itself better at reasoning about your code. And because you've set clear guardrails with meta-rules, the quality compounds rather than degrades.

Getting Started: Two Templates

I'm providing two files to help you implement this system.

CLAUDE_TEMPLATE.md is a minimal starting point with the two-tier structure, the essential meta-rules, and a few universally useful rules like keeping imports at the top of files and avoiding magic numbers in tests. Use this if you're starting fresh or want a clean foundation to customize.

CLAUDE_FULL.md is our complete CLAUDE.md showing what the system looks like after months of evolution. It includes project-specific guidelines, examples of well-written rules at various sophistication levels, and demonstrates how the meta-rules maintain quality as the document grows. Use this if you want to see a fully-evolved example or work in a similar tech stack.

Both files demonstrate the structure and show you what a self-improving CLAUDE.md looks like in practice.

What This Really Is

This pattern reveals something profound about working with AI. Current AI has three fundamental limitations when coding: no project memory across sessions, no learning mechanism from corrections, and expensive fine-tuning requirements.

This approach solves all three using only a markdown file, a structured format, and one magic prompt. CLAUDE.md provides memory—it's read every session. The reflection prompt provides learning—mistakes become permanent lessons. The meta-rules provide quality control—the AI maintains its own standards.

You're creating a crude but effective form of continuous learning. No special tools. No API access. No fine-tuning costs. Just smart allocation of human critical thinking and AI execution capacity.

The deeper pattern is about self-reinforcing loops. The best workflows with AI aren't linear sequences—they're loops where the output feeds back to improve the input. Every mistake makes the system smarter. Every correction improves future sessions. Every rule makes the next rule easier to learn.

Conclusion: The Division of Labor

The breakthrough isn't having a CLAUDE.md file—everyone has that. The breakthrough is making it trivially easy for the AI to improve itself continuously while you focus on critical thinking.

You provide the critical thinking: spotting patterns, preventing problems, setting guardrails, deciding what matters. Claude provides the execution: analyzing mistakes, extracting principles, writing documentation, maintaining consistency.

The meta-rules ensure quality compounds as the document grows. The reflection prompt automates the entire improvement cycle. Together they create a system where every mistake becomes permanent learning with minimal human effort.

You're not just documenting standards. You're building a system that teaches itself while you focus on what humans do best: the critical thinking that spots patterns and sets direction.

The magic words are simple: "Reflect, abstract, generalize, add to CLAUDE.md"

That's the prompt that turns abundant AI execution into continuous self-improvement, with your critical thinking as the guide.

Quick Links

Vibe coding: What we learned from flip-flopping 8 times on a simple formula

Aviad Rozenhek — Thu, 06 Nov 2025 21:03:57 +0000

The Budget Calculator Paradox: When Tests Don't Match Reality

Part 6 of the Multi-Agent Development Series

Part 1: Can 5 Claude Code Agents Work Independently?
Part 2: The Reality of "Autonomous" Multi-Agent Development
Part 3: Property-Based Testing with Hypothesis
Part 4: Zero-Conflict Architecture
Part 5: Communication Protocols for AI Agents

TL;DR

The Problem: Create a budget calculator to determine minimum checks/minute needed for fair participant checking.

What happened: Agent flip-flopped 8 times on the formula, changing tier ordering and capacity calculations reactively based on test failures instead of proactively based on requirements.

User intervention (3 times):

"you keep changing the policy without reflecting on this constant changes"
"did you just make the necessary budget in the calculator TIGHTER when you are not successfully running the scenarios with more relaxed budgets?"
"make the calculator CORRECT and then provide it some extra margin"

The lessons:

Build the calculator FIRST, then use it to validate test expectations
Don't react to test failures by changing the policy - ask "is my expectation realistic?"
Account for reality: Cycle quantization, integer rounding, margins for variance
Separate concerns: Policy (what should happen) vs Budget (what's needed for it to work)

The correct pattern:

1. Define policy requirements (tier ordering, deadlines)
2. Build budget calculator based on requirements
3. Add safety margin (1.2x - 1.5x minimum)
4. Use calculator in tests to set expectations
5. If tests fail, debug the bug - don't change calculator or policy

The Setup

Context: Video moderation system needs to check participants at different rates based on risk and staleness.

Core question: How many checks/minute budget is needed to ensure fairness?

Naive answer: num_participants / recheck_interval

Example: 10 participants, 60s recheck → 10 checks/min

Reality: Way more complicated.

Iteration 1: The Optimistic Formula

Agent's first attempt:

def calculate_min_budget_for_fairness(num_participants, critical_deadline=20):
    """Calculate minimum budget needed to check all participants before critical deadline."""
    # Continuous time formula
    time_to_check_all = critical_deadline  # seconds
    checks_per_second = num_participants / time_to_check_all
    checks_per_minute = checks_per_second * 60
    return checks_per_minute

Example:

10 participants, critical_deadline=20s
time_to_check_all = 20s
checks_per_minute = (10 / 20) * 60 = 30 checks/min

Test using this:

def test_fairness_with_sufficient_budget():
    participants = [make_participant(f"p{i}") for i in range(10)]
    budget = calculate_min_budget_for_fairness(10, critical_deadline=20)
    # budget = 30 checks/min

    scenario = run_scenario(participants, budget=budget, duration=60)

    # Expected: All 10 participants checked within 20s
    first_cycle_checks = scenario.timeline[0:4]  # First 20s (4 cycles × 5s)
    all_checked = set()
    for cycle in first_cycle_checks:
        all_checked.update(cycle.checked_users)

    assert len(all_checked) == 10, "All participants should be checked within 20s"

Result: ❌ Test failed! Only 7/10 participants checked in 20s.

Why It Failed

Problem 1: Cycle Quantization

Formula assumed continuous time:

checks_per_second = 10 / 20 = 0.5

Reality uses discrete 5s cycles:

cycle_interval = 5  # seconds
checks_per_cycle = int(budget / 60 * cycle_interval)
                 = int(30 / 60 * 5)
                 = int(2.5)
                 = 2  # Integer truncation!

Actual budget: 2 checks per 5s cycle = 24 checks/min (not 30)

Problem 2: Tier Ordering

Formula assumed all participants get checked sequentially. Reality uses tier prioritization:

Tier 1 (at deadline): Gets checked first
Tier 2 (fairness/never moderated): Only if budget left after Tier 1

With budget=2 per cycle:

Cycle 1: Check 2 never-moderated participants → into Tier 1 (at deadline)
Cycle 2: Those 2 are still Tier 1, get checked again! Only 2 new participants checked.
Cycle 3: Now 4 in Tier 1, but budget=2, so only 2 get rechecked. 2 new added.
Result: Tier 1 monopolizes budget, never-moderated participants starve

Iteration 2: The Reactive Fix (First Flip-Flop)

Agent's reaction: "Tests fail because tier ordering is wrong!"

Change 1: Flip tier priority

# Before:
Tier 1: At deadline (staleness prevention)
Tier 2: Never moderated (fairness)

# After:
Tier 1: Never moderated (fairness FIRST)
Tier 2: At deadline (staleness second)

Rationale: "Fairness should come before staleness prevention"

Test:

budget = 30  # Same budget
scenario = run_scenario(participants, budget=30, duration=60)
# Now fairness-first ordering

Result: ✅ Test passed! All 10 checked within 20s.

But new test failed:

def test_staleness_prevention():
    """Participants at deadline should be checked before going stale"""
    # This now fails because fairness beats staleness!

User Intervention #1

"you keep changing the policy without reflecting on this constant changes"

The question: Are you changing the policy based on requirements, or reactively based on test failures?

Agent realized: Policy should be based on requirements (what SHOULD happen), not test outcomes (making tests pass).

Iteration 3: Back to Staleness-First (Second Flip-Flop)

Agent reconsidered:

Staleness prevention is more critical (safety requirement)
Fairness is important but secondary
Revert tier ordering back to staleness-first

But now the fairness test fails again!

Agent's realization: "The budget formula is wrong, not the tier ordering."

Iteration 4: Account for Tier Monopolization

The insight: With staleness-first, once a participant is checked, they stay in Tier 1 (at deadline) until rechecked.

This means:

Must check ALL N participants BEFORE the first one needs rechecking
Not before the "critical deadline" (20s), but before the "recheck interval" (60s)

Updated formula:

def calculate_min_budget_for_fairness(num_participants, recheck_interval=60):
    """Calculate minimum budget to check all participants before first recheck."""
    cycle_interval = 5  # seconds
    num_cycles_available = recheck_interval / cycle_interval  # 60/5 = 12 cycles

    checks_per_cycle_needed = num_participants / num_cycles_available  # 10/12 = 0.83

    # Convert to checks/minute and account for integer truncation
    checks_per_minute = (checks_per_cycle_needed / cycle_interval) * 60
                      = (0.83 / 5) * 60
                      = 10 checks/min

    return checks_per_minute

Example:

10 participants, 60s recheck
Need to check all 10 before first recheck (60s)
12 cycles available (60s / 5s)
Need 0.83 checks/cycle → 10 checks/min

Test:

budget = calculate_min_budget_for_fairness(10, recheck_interval=60)
# budget = 10 checks/min

scenario = run_scenario(participants, budget=10, duration=60)
# Expected: All 10 checked within 60s (before first recheck)

Result: ❌ Still failed! Only 8/10 checked in 60s.

Why It Still Failed

Integer truncation strikes again:

checks_per_cycle = int(10 / 60 * 5) = int(0.83) = 0

0 checks per cycle! Budget completely wasted by integer truncation.

Iteration 5: Make the Formula TIGHTER (Wrong Direction!)

Agent's reaction: "Budget of 10/min should be sufficient but tests fail. Formula must be too loose!"

Change:

def calculate_min_budget_for_fairness(num_participants, critical_deadline=15):
    # TIGHTER deadline: 15s instead of 60s
    ...
    return 40 checks/min  # Much higher!

User Intervention #2

"did you just make the necessary budget in the calculator TIGHTER when you are not successfully running the scenarios with more relaxed budgets?"

The issue: When budget "should be sufficient" but tests fail, agent made the formula STRICTER (requiring more budget) instead of adding MARGIN (accounting for waste).

The question: If 10/min "should work" but doesn't, is the formula wrong OR does it need margin for integer truncation?

Iteration 6: Add Margin for Reality

Agent's realization: Formula calculates theoretical minimum. Reality has waste (integer truncation, tier switching overhead).

Solution: Add margin

def calculate_min_budget_for_fairness(num_participants, recheck_interval=60, margin=1.5):
    """Calculate minimum budget with safety margin."""
    cycle_interval = 5
    num_cycles = recheck_interval / cycle_interval

    # Theoretical minimum
    min_checks_per_cycle = num_participants / num_cycles
    min_checks_per_minute = (min_checks_per_cycle / cycle_interval) * 60

    # Add margin for integer truncation and overhead
    recommended_budget = min_checks_per_minute * margin

    return recommended_budget

Example:

10 participants, 60s recheck
Minimum: 10 checks/min
With 1.5x margin: 15 checks/min

Test:

budget = calculate_min_budget_for_fairness(10, recheck_interval=60, margin=1.5)
# budget = 15 checks/min

scenario = run_scenario(participants, budget=15, duration=60)

Result: ✅ Test passed! All 10 participants checked within 60s.

Iteration 7: Rolling Accumulator for Fractional Checks

But another problem surfaced:

Scenario: budget=20 checks/min (fractional: 1.67 checks/cycle)

checks_per_cycle = int(20 / 60 * 5) = int(1.67) = 1

Lost capacity: 0.67 checks/cycle wasted!

Provided: 20 checks/min
Actually used: 12 checks/min (1 per cycle × 12 cycles)
Wasted: 40% of budget!

Solution: Rolling accumulator

class BudgetAccumulator:
    def __init__(self, checks_per_minute):
        self.checks_per_minute = checks_per_minute
        self.accumulated = 0.0

    def get_checks_this_cycle(self, cycle_interval=5):
        # Add fractional checks to accumulator
        self.accumulated += (self.checks_per_minute / 60) * cycle_interval

        # Return integer checks available
        available = int(self.accumulated)
        self.accumulated -= available
        return available

Example:

budget=20 checks/min = 1.67 checks per 5s cycle
Cycle 1: accumulated=0 + 1.67 = 1.67 → return 1, accumulated=0.67
Cycle 2: accumulated=0.67 + 1.67 = 2.34 → return 2, accumulated=0.34
Cycle 3: accumulated=0.34 + 1.67 = 2.01 → return 2, accumulated=0.01
Pattern: 1, 2, 2, 1, 2, 2, ... → Averages to 1.67 checks/cycle ✅

Result: Budget utilized accurately, no waste from integer truncation.

Iteration 8: Build Calculator FIRST, Use It in Tests

Final insight: Agent was writing tests with ad-hoc budget assumptions, then building calculator to justify them. Backwards!

Correct workflow:

Define policy requirements (tier ordering, recheck intervals)
Build budget calculator based on policy
Add safety margin (1.2x - 1.5x)
Use calculator in tests to set expectations

Example:

# tests/test_budget_allocation.py

def test_fairness_with_sufficient_budget():
    participants = [make_participant(f"p{i}") for i in range(10)]

    # USE THE CALCULATOR to determine budget
    budget = calculate_min_budget_for_fairness(
        num_participants=10,
        recheck_interval=60,
        margin=1.5  # Safety margin
    )
    # budget = 15 checks/min

    scenario = run_scenario(participants, budget=budget, duration=60)

    # Now expectation is realistic (based on calculator)
    all_checked = get_all_checked_users(scenario, duration=60)
    assert len(all_checked) == 10

Key change: Tests don't assume arbitrary budgets. They CALCULATE needed budget, then verify it works.

The Lessons

Lesson 1: Build Calculator First, Use It Everywhere

Anti-pattern:

# Test makes up budget
def test_fairness():
    budget = 20  # Seems reasonable?
    ...
    assert all_checked  # Fails!

# Agent fixes by adjusting budget
def test_fairness():
    budget = 30  # Try higher?
    ...
    assert all_checked  # Still fails!

# Repeat until tests pass...

Correct pattern:

# Build calculator based on requirements
def calculate_min_budget(...):
    # Account for cycle quantization
    # Account for tier ordering
    # Add safety margin
    return recommended_budget

# Use calculator in tests
def test_fairness():
    budget = calculate_min_budget(num_participants=10, margin=1.5)
    ...
    assert all_checked  # Passes because budget is correct!

The calculator is the single source of truth.

Lesson 2: Don't React to Test Failures by Changing Policy

Anti-pattern:

Test fails → Change tier ordering
Test fails → Change deadline
Test fails → Change formula
→ Flip-flopping, no stability

Correct pattern:

Test fails → Ask: "Is my expectation realistic?"
            → Debug: What's the actual behavior?
            → Understand: Why does it differ?
            → Fix: The bug OR the test expectation (not the policy)

Policy should be based on requirements, not test outcomes.

Lesson 3: Account for Reality (Cycle Quantization)

Continuous time formulas are optimistic:

# Theory:
checks_per_second = 0.5
checks_per_minute = 30

# Reality (5s cycles):
checks_per_cycle = int(30 / 60 * 5) = 2
actual_checks_per_minute = (2 / 5) * 60 = 24  # Not 30!

Formula must account for discrete cycles:

def calculate_budget(checks_per_minute):
    cycle_interval = 5
    checks_per_cycle = checks_per_minute / 60 * cycle_interval

    # Account for integer truncation
    actual_checks_per_cycle = int(checks_per_cycle)
    actual_budget = (actual_checks_per_cycle / cycle_interval) * 60

    if actual_budget < checks_per_minute:
        # Warn about quantization loss
        print(f"Warning: Requested {checks_per_minute}/min, actual {actual_budget}/min")

    return actual_budget

Lesson 4: Separate Concerns

3 distinct concerns got mixed up:

1. Policy (WHAT should happen)

# Business logic
Tier 1: Staleness prevention (at deadline)
Tier 2: Fairness (never moderated)
Tier 3: Critical urgency (approaching deadline)

2. Budget (HOW MUCH is needed for policy to work)

# Capacity planning
def calculate_min_budget(num_participants, recheck_interval):
    # Given policy, how much budget needed?
    return minimum_budget * margin

3. Tests (VERIFY policy works with given budget)

# Validation
def test_policy_works():
    budget = calculate_min_budget(...)  # Use calculator
    scenario = run_scenario(budget=budget)
    assert policy_invariants_hold(scenario)  # Verify

When test fails:

Don't change Policy (unless requirements changed)
Don't change Budget formula (unless calculation wrong)
DO debug: Is there a bug in implementation?

The Correct Formula (Final Version)

def calculate_min_budget_for_fairness(
    num_participants: int,
    recheck_interval_seconds: int = 60,
    cycle_interval_seconds: int = 5,
    margin: float = 1.5
) -> float:
    """
    Calculate minimum checks/minute needed to ensure all participants
    are checked before first recheck (fairness requirement).

    Args:
        num_participants: Number of participants to check
        recheck_interval_seconds: Time before participant needs recheck (default 60s)
        cycle_interval_seconds: System cycle interval (default 5s)
        margin: Safety margin to account for quantization and overhead (default 1.5x)

    Returns:
        Recommended checks per minute (float)

    Example:
        >>> calculate_min_budget_for_fairness(10, recheck_interval=60)
        15.0  # 10 participants need 10 checks/min minimum, 15 with 1.5x margin
    """
    # How many cycles available before first recheck?
    num_cycles_available = recheck_interval_seconds / cycle_interval_seconds

    # How many checks per cycle needed?
    checks_per_cycle_needed = num_participants / num_cycles_available

    # Convert to checks per minute
    min_checks_per_minute = (checks_per_cycle_needed / cycle_interval_seconds) * 60

    # Add safety margin for:
    # - Integer truncation in checks_per_cycle
    # - Tier switching overhead
    # - Variance in participant arrival times
    recommended_checks_per_minute = min_checks_per_minute * margin

    return recommended_checks_per_minute

Usage:

# For 10 participants needing 60s recheck:
budget = calculate_min_budget_for_fairness(10, recheck_interval=60, margin=1.5)
# Returns: 15.0 checks/min

# For 50 participants needing 30s recheck:
budget = calculate_min_budget_for_fairness(50, recheck_interval=30, margin=1.2)
# Returns: 120.0 checks/min (50 participants × 2 checks/min × 1.2 margin)

Real-World Application

Use Case 1: API Rate Limiting

Problem: How many requests/second needed to process N items within deadline D?

Formula:

def calculate_min_rps(num_items, deadline_seconds, margin=1.5):
    min_rps = num_items / deadline_seconds
    recommended_rps = min_rps * margin
    return recommended_rps

But account for:

Batch size quantization (API processes in batches of 100)
Network latency overhead
Retry margin
Concurrent request limits

Use Case 2: Worker Pool Sizing

Problem: How many workers needed to process N jobs within SLA S?

Formula:

def calculate_min_workers(num_jobs, sla_seconds, avg_job_duration, margin=1.5):
    # Continuous time minimum
    min_workers = (num_jobs * avg_job_duration) / sla_seconds

    # Account for discrete worker count
    min_workers = math.ceil(min_workers)

    # Add margin for variance
    recommended_workers = int(min_workers * margin)

    return recommended_workers

Margin accounts for:

Job duration variance (some take 2x average)
Worker startup time
Occasional failures requiring retries

Use Case 3: Cache Sizing

Problem: How much cache needed to keep N items with TTL T?

Formula:

def calculate_min_cache_size(num_items, ttl_seconds, request_rate_per_sec, margin=2.0):
    # Items alive at any time
    items_in_cache = num_items * (request_rate_per_sec * ttl_seconds) / num_items

    # Add margin for spikes
    recommended_size = items_in_cache * margin

    return recommended_size

Margin accounts for:

Traffic spikes (2x normal)
Non-uniform access patterns
Cascading failures if cache full

Debugging Budget Issues: A Checklist

When tests fail with "sufficient" budget:

1. Check Integer Truncation

# Calculate what you're actually getting
budget = 20  # checks/min
checks_per_cycle = int(budget / 60 * cycle_interval)
actual_budget = (checks_per_cycle / cycle_interval) * 60

if actual_budget != budget:
    print(f"Truncation loss: {budget} requested, {actual_budget} actual")

2. Check Cycle Quantization

# Ensure formula accounts for discrete cycles
# DON'T: continuous_time_formula(participants, deadline)
# DO: discrete_cycle_formula(participants, num_cycles, cycle_interval)

3. Check Tier Monopolization

# Verify tier ordering doesn't cause starvation
# If staleness-first: Must check ALL before first recheck
# If fairness-first: Must have enough budget after fairness for staleness

4. Check Margin Sufficiency

# Is 1.5x margin enough?
# Try 2.0x - if tests pass, margin was the issue
# If tests still fail, there's a real bug (not just insufficient budget)

5. Check Formula Matches Policy

# Does formula assume fairness-first but policy is staleness-first?
# Formula must match actual tier ordering

The Flip-Flop Timeline (Summary)

Iteration	Change	Reason	Outcome
1	Formula: checks_per_minute = (N / deadline) * 60	Initial attempt	❌ Failed (quantization)
2	Flip tier order: fairness-first	Tests failed, assumed policy wrong	✅ Fairness test passed, ❌ staleness test failed
3	Flip back: staleness-first	User intervention: stop flip-flopping	❌ Fairness test failed again
4	Formula: checks before recheck (not deadline)	Realized monopolization issue	❌ Still failed (truncation)
5	Make formula TIGHTER (wrong direction!)	Tests failed at "sufficient" budget	❌ User intervention: wrong direction
6	Add 1.5x margin	Account for waste	✅ Tests passed!
7	Add rolling accumulator	Eliminate truncation waste	✅ Budget utilized fully
8	Build calculator FIRST, use in tests	Correct workflow	✅ Stable, correct

Total iterations: 8
User interventions: 3
Time wasted: ~2 hours
Time with correct approach: ~30 minutes

Lesson: Build calculator first, stop flip-flopping, add margin.

Conclusion

What we learned:

✅ Build budget calculator FIRST, then use it to validate tests
✅ Don't react to test failures by changing policy
✅ Account for reality: quantization, truncation, margin
✅ Separate concerns: Policy vs Budget vs Tests

The anti-pattern:

Write test → Guess budget → Test fails → Change policy → Test fails → Change formula → ...

The correct pattern:

Define policy → Build calculator → Add margin → Use in tests → Debug bugs (not policy)

Would we make this mistake again? Probably not! The lesson was learned through painful iteration.

Next time:

Define policy requirements clearly upfront
Build budget calculator based on policy
Add realistic margin (1.5x - 2.0x)
Use calculator in ALL tests
If tests fail with calculator-provided budget, it's a BUG (not wrong budget)

Series Conclusion

We've covered 6 aspects of multi-agent AI development:

Part 1: Can 5 Claude Code Agents Work Independently? - The optimistic hypothesis
Part 2: The Reality of Autonomous Development - Human orchestration required (31% autonomy)
Part 3: Property-Based Testing with Hypothesis - The data you're throwing away
Part 4: Zero-Conflict Architecture - File-level ownership (100% auto-merge)
Part 5: Communication Protocols for AI Agents - 4 iterations to file-based messaging
Part 6: The Budget Calculator Paradox - Build it first, use it everywhere

Overall lessons:

Zero-conflict architecture works (100% auto-merge)
Human-AI collaboration > pure autonomy (orchestration essential)
Verification before coding (model introspection prevents wasted effort)
Knowledge preservation (capture Hypothesis shrunken cases)
Build calculators first (don't guess in tests)
Explicit communication (templates, commands, not assumptions)

Was it worth it? Absolutely. 75% time savings despite 12.5% orchestration overhead. Would do it again with lessons learned.

Tags: #budget-calculator #capacity-planning #testing #formulas #quantization #margin #lessons-learned

This is Part 6 (Final) of the Multi-Agent Development Series.

Discussion: Have you struggled with capacity formulas? Do you build calculators first or adjust tests reactively? What's your approach to margin and safety factors? Share in the comments!

DEV Community: Aviad Rozenhek

The Bitter Lesson for Backend Engineering

The Core Thesis

The Problem We Unleashed It On

The Limits Of The 30 W Brain

The Ego Cost Is Familiar

The Case Study

The Evidence At Scale

The Investigation At Scale

Treatments At Scale

The Meta-Method And Architecture

The Guardrails

The Real Lesson

External Reference

Turning Linear and Figma Into a Searchable Knowledge Layer with `aviadr1/issueclaw` and `aviadr1/figmaclaw`

TL;DR

The problem: product knowledge is fragmented by tool boundaries

The core idea

What aviadr1/figmaclaw does

What aviadr1/issueclaw does

Why this is more useful than it sounds

1. Full-text search across designs

2. Feature tracing from spec to design to implementation

3. Reconstructing design decisions

4. Platform parity analysis

5. Design system audits

The most interesting capability: drift detection

Why not just use Figma MCP directly?

How figmaclaw stays current

Why sync and enrichment are separate

What this means for AI workflows

How to use this privately on your own team

Who should care

The takeaway

Claude Feels Slow. But Is Moving a Team to Open-Weight Models Actually the Fix?

TL;DR

The problem: Claude feels like it crawls

TTFT from our real usage

The surprising part: decode speed wasn’t the main problem

Generation speed

Why open weights sound like the obvious answer

Best-case self-hosted framing

The models we’d actually consider

Models considered

Inference engines

The part that makes this much less exciting: GPU budgets are shared

What the budget can buy

The real metric is not TTFT. It’s team step time.

Why I’m not yet sold

Self-hosted decode estimates I considered

So where does that leave me?

What seems true

Bottom line

I Replaced My LangSmith MCP Server with a 200-Token CLI Skill

The Problem with MCP Servers

The Fix: On-Demand Skills Instead of Always-On Schemas

96% Token Reduction with --fields

Built for Two Audiences

Features the MCP Server Doesn't Have

Live Monitoring with runs watch

One-Command Debugging with runs get-latest

Stratified Sampling with runs sample

Aggregate Analytics with runs analyze

Schema Discovery with runs fields / runs describe

Tag & Metadata Discovery

Bulk Export with Pattern Filenames

Production Run to Eval Example in One Command

Smart Filtering That Translates to FQL

What's New in v0.4.0

Getting Started

Using AI to Audit Engineering Team Output: An Experiment with Claude

The Core Idea: An AI-Powered Engineering Audit

Running the Experiment

Reflections: Does AI Underestimate—or Cut Through Bias?

Broader Implications and Open Questions

Conclusion

engineering #productivity #management #ai

LangSmith CLI: Not Just Efficient — Actually Better Than MCP

1. Context Is Precious — 177× Less Waste

2. Real-time Production Monitoring — runs watch

What `aviadr1/figmaclaw` does

What `aviadr1/issueclaw` does

How `figmaclaw` stays current

96% Token Reduction with `--fields`

Live Monitoring with `runs watch`

One-Command Debugging with `runs get-latest`

Stratified Sampling with `runs sample`

Aggregate Analytics with `runs analyze`

Schema Discovery with `runs fields` / `runs describe`

2. Real-time Production Monitoring — `runs watch`