DEV Community: Michael Tuszynski

The Million-Token Context Window Changes What You Put In It

Michael Tuszynski — Tue, 02 Jun 2026 16:30:05 +0000

The 1M-token context window is here. Opus 4.8, 4.7, and 4.6, plus Sonnet 4.6, now carry the full million-token context on the Claude API, Amazon Bedrock, and Vertex AI — no surcharge, generally available. A single request can hold up to 600 images or PDF pages. The reflex, the second the number lands, is to point the tool at the whole repo and let it rip.

That is the wrong move. The operators who get real value out of the bigger window treat it as a curated working set, not a junk drawer. What you load decides what the model reasons about, and a million tokens of noise still produces noisy output.

The cleanest statement of this comes from the vendor's own documentation. Anthropic's docs say the 1M window's retrieval gains "depend on what's in context, not just how much fits" (context windows). Read that sentence twice. The platform that just handed you a million tokens is telling you, in the same breath, that capacity and effective use are different things.

Capacity Is Not the Same as Attention

Here is the mistake the dump-the-repo reflex makes. It assumes the model reads a million tokens the way a database reads a million rows — uniformly, with equal fidelity from the first byte to the last. It does not.

Chroma's "Context Rot" research evaluated 18 frontier models — Claude 4, Gemini 2.5, Qwen3, and a dozen others — and found that performance "grows increasingly unreliable as input length grows" (Context Rot). Models do not process long context uniformly. They degrade, and they degrade "even on simple tasks." A task the model nails at 10K tokens gets less reliable at 800K, holding everything else constant.

The type of filler matters too. Chroma found that locally-cancelling operations hurt more than neutral print statements, and topically-related distractors degrade answers non-uniformly. Translate that to a codebase: loading the wrong 800K does not just waste space. It actively poisons the model's reasoning over the right 200K. The half-relevant module, the deprecated helper, the three abandoned migration scripts that look load-bearing — those are not free passengers. They are distractors with a vote.

So the question is never "will it fit." With a million tokens, almost everything fits. The question is "what does loading this do to the model's attention on the part that matters."

The Smallest High-Signal Set

Anthropic's applied AI team frames the discipline directly. Good context engineering means finding "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome" (effective context engineering). Context is "a finite resource with diminishing marginal returns." Every token you load spends from a shared attention budget.

That last point is the one to internalize. The attention budget is shared. The 50 lines of code that actually contain the bug compete for the model's focus with every other token in the window. Pad the window with the rest of the repo and you have not given the model more help. You have given the 50 lines more competition.

This reframes the whole job. The operator's task is not "assemble everything that could conceivably be relevant." It is "curate the smallest set that makes the answer likely." Those are opposite instincts. The first is collection. The second is editing. The bigger window rewards editors and punishes collectors, because the collector's reflex scales straight into the rot.

In practice the curated set looks like intent, not coverage. The file with the bug, its direct callers, the test that fails, the relevant interface, the one config that governs the behavior. Maybe 200K of the right tokens, assembled because you decided each one earns its place. Not a million tokens assembled because the window happened to be that big.

What the Headroom Is Actually For

So if the answer is "feed it the right 200K," what is the other 800K for? It is not for padding. It is for the cases that genuinely need it — the codebase-wide refactor that legitimately touches forty files, the migration that has to reason across a sprawling schema, the incident where the relevant signal really is distributed across a large surface. Those exist. The headroom is there to serve them when they show up, deliberately, not to be the default fill level for every request.

Anthropic's tooling makes the "manage it, don't fill it" stance concrete. The platform ships server-side compaction, context editing that clears stale tool results and thinking blocks, and "context awareness" — models that track their own remaining token budget rather than guessing how many tokens remain (context windows). That is an architecture for spending a finite budget on purpose. None of it would make sense if the design intent were "load everything and let the window sort it out."

The pattern across all three primary sources is the same. The vendor gives you a million tokens, builds the tooling to help you spend them carefully, and states in the docs that what you put in decides what you get out. The empirical research from outside the vendor confirms the failure mode the docs are warning about: more input, less reliability, even on simple tasks.

The Discipline the Bigger Window Demands

The 1M window is a real capability gain. The forty-file refactor that used to require chunking and stitching can now happen in one pass. That is worth having. But the gain only materializes for operators who bring intent to what they load.

Treat the window as a working set you curate. Feed it the right 200K because you decided each token earns its place. Reach for the headroom when the task genuinely spans a large surface, and reach for the compaction and context-editing tools to keep the working set clean as the session runs long. The discipline is the same one good engineers already apply to a code review: the question is not how much you can put in front of someone, it is what they actually need to see to make the call.

The number on the box went up by 5x. The skill that turns it into output did not change. If anything, the bigger window raises the price of getting it wrong — a million tokens of noise is a much louder distraction than 200K of it ever was. Spend the budget like it is scarce, because for the model's attention, it still is.

Absorption Rate Is a Praxis Problem

Michael Tuszynski — Fri, 29 May 2026 16:30:06 +0000

A LinkedIn post circulating this week made an observation worth sitting with. AI write-throughput in a serious engineering team can hit fifty pull requests a day. The team's merge-throughput is two. The author's own bug-finder product had hit the same wall on the customer side — initial deliveries of twenty findings per session, sometimes fifty, came back as a request for two a day. The findings were fine. The cost was elsewhere. Review, validate, merge, and regression-watch costs hours per item, multiplied by every engineer on the team, makes the math fall apart fast. The bottleneck moved. The new ceiling is what the team can absorb.

The diagnosis is right. The prescription the bug-finder category will produce — "rank better, surface fewer, prioritize harder" — repeats the mistake the cloud-runtime layer made before it. The right answer to "which two findings today" is not a smarter ranker. It is a different category of product entirely, and the path to getting there is the same one yesterday's piece on enterprise distribution walked at the policy layer.

"Which Two" Is a Values Question

The reason a ranking algorithm cannot answer "which two findings should this team look at today" is that the answer depends on information the ranker cannot see. The team had a customer-facing incident last Thursday that traced to a specific subsystem; findings in that subsystem are worth more this week than they will be in three weeks. There is a marquee launch on Wednesday; findings touching the deploy path or the feature flag system are urgent in a way no static rule will capture. The payments code is in calibration mode; the team treats finding-class A as cosmetic in payments and catastrophic in identity. The new hire is reviewing in their first week and the standards have been deliberately relaxed for their first three PRs so the team can see how they reason about ambiguity.

None of that is in the codebase. It lives in the team's lived context — incident logs, deal calendars, code-area maturity assessments, hiring decisions, the conversation in last Friday's retro. A ranker that looks at the AST and the static analysis output and the commit history is solving the prior problem, which was "find more bugs." The product problem now is "let the team encode its own values in a way the finder can rank against." That is an authoring problem. Calling it a ranking problem will produce a generation of finders that compete on signal-to-noise and lose every renewal at the eighteen-month mark when the customer realizes the signal-to-noise was never the bottleneck.

What Absorb Rate Actually Depends On

A team that absorbs ten changes a day without a regression has done specific things to make that possible. The CODEOWNERS file is calibrated so the right person sees the right change without a thread of "who should review this?" The review template asks the questions that surface blast radius before line-level critique — what does this touch, what could fail under load, what is the rollback path. The merge gate is configured to catch the failure modes the team has actually hit in the last six months, not the ones some other team hit in a different stack. The on-call rotation is staffed at a level where the engineer who merges at four pm is not the engineer who pages at midnight, so the merger has slack to think.

All of those are practice artifacts. None of them are product features. A team that has done this work absorbs more change with less risk than a team that has not. The output of the work is review craft, codeowner judgment, lint config that matches the team's specific failure history, and a culture in which the question "what could break this?" gets asked before "is this style-correct?" The ceiling moves. The work to move it is craft work, and the category-level honesty about it is what bug-finder vendors will eventually be forced into when their growth model bumps the absorb-rate cap on every account.

Where the Customer Feedback Was Pointing

The post's most useful detail is the customer telemetry. Twenty findings per session, sometimes fifty, came back as a request for two a day. Read that as a UX complaint and the answer is a ranking algorithm. Read it as a domain truth and the answer is different. Customers are saying that the unit of useful work in their environment is "two items per day that fit the team's actual attentional budget after the rest of the day's responsibilities," and that the input format of "fifty findings ranked by static severity" is the wrong shape for that unit. The product question is not "how do we surface the right two." The product question is "how do we let the team author the filter that determines what the right two means for them this week."

That is the wrapper-pattern argument at the bug-finder layer. The vendor provides the surface; the operator authors the standard that runs against the surface. The vendor that ships a configurable filter — backed by the team's own incident log, codeowner-weighted blast-radius scores, deploy-calendar awareness, individualized risk profiles per code area — is shipping for the actual market. The vendor that ships a "smarter default ranker" is shipping for a market that exists only in the spec doc.

The Generation-Absorption Gap Is the Signal

The gap between "AI ships 50, your team merges 2" is being read in the category press as a problem to close. The category press has it backward. The gap is the signal of where the high-yield work now lives. The work has moved from writing code to judging code — to deciding which changes are worth the absorb cost given everything the team is doing this week. Judgment is not a product feature. Judgment is a praxis output, accumulated from sustained time in the actual work, encoded in the artifacts the team produces and the conversations they have at standup.

The bug-finder category will split over the next eighteen months. The vendors that recognize they are selling into a judgment surface will build the authoring layer the operators need. The vendors that keep selling "more findings, better ranked" will hit the absorb-rate ceiling on every account and the renewal conversations will turn into pricing pressure. The cloud-runtime layer already played this out at a different layer of the stack. The pattern repeats.

The next layer at this scale is the same as the next layer at the enterprise scale. Authorship. Which operator, with what depth of context, encodes the rules that decide what gets attention today. The vendors did their part. The hard part is whether the team's judgment gets a place to live inside the product.

Usage Standards Are a Praxis Problem

Michael Tuszynski — Thu, 28 May 2026 16:30:10 +0000

The cloud-runtime question for enterprise Claude is settled. AWS shops route through Amazon Bedrock. Google Cloud shops route through Vertex AI. Direct deployments now sit inside the customer's perimeter via self-hosted sandboxes and MCP tunnels that shipped this month. The question "where should Claude run" has a workable answer for every common stack, and a circulating LinkedIn post this week argued — correctly — that the debate has shifted from "should we use Claude" to "where do we wire it in."

The diagnosis is right. The prescription that usually follows — "now write the usage standards" — is the part most enterprises will get wrong, for the same reason most enterprise AI rollouts have already gotten this wrong on every prior tool.

The Word "Standards" Is Doing Work

Calling the next layer "usage standards" makes it sound like a policy artifact. A document the central AI committee drafts, approves, and circulates. A page on the wiki. A two-hour all-hands. That is what most enterprises will produce in Q3. It will not survive contact with the actual work, and the divergence between the document and the operator behavior will start within sixty days of publication.

The reason is structural and the corpus has been arguing it from a different angle for six weeks. The May 20 piece on faculty AI governance laid out the version of this argument applied to higher education. The mechanism is identical in the enterprise. Standards that hold under stress are authored by operators who have worked through the actual edge cases in their actual workflows for weeks. Standards that fail are authored by committees of policy interpreters who have used the tool in artificial training contexts and are extrapolating from surface familiarity to operational judgment.

The output of those two processes carries the same word — "standards" — and looks the same in PDF. One survives the first stress case. The other becomes the canonical example of the gap between policy and practice.

What the Edge Cases Actually Look Like

The reason a central standards document fails is that the standards that matter are discipline-specific. Marketing's prompts touch customer data and outbound brand voice; the failure modes are leakage of unpublished campaigns and tonal mismatch on cross-channel publish. Finance's prompts touch material non-public information and account-level identifiers; the failure modes are inadvertent disclosure into model context and downstream training. Legal's prompts touch privileged communications; the failure modes are waiver and the discoverability of LLM transcripts in subsequent litigation. Engineering's prompts touch production data, customer PII, and intellectual property in source form; the failure modes are everything the security team already knows plus the new class of model-egress risks.

A central usage-standards document compresses all of these into a paragraph each, written by someone whose closest experience is the third-party security review of the procurement contract. The compression itself is the failure. The marketing operator who has actually been red-teaming customer-data prompts for six weeks knows the failure mode that matters in their work. The compression in the central document does not capture it because the compression was done by someone who was never in the seat.

Where Real Standards Come From

The enterprises that will end up with standards worth defending are the ones who treat the authoring layer the way safety-critical industries treat their procedure manuals. Operators in each domain spend sustained time in the actual workflow with the tool, surface specific failures, document them, and write the discipline-specific rules. The central function's job is to aggregate, not to author. The marketing team's three pages on outbound-brand prompts is the canonical document for marketing. The finance team's two pages on MNPI handling is the canonical document for finance. The aggregated central artifact is a table of contents, not a policy.

This is the wrapper-pattern argument at the enterprise scale. Same shape as the personal version. The operator writes the rules they encode in their own hooks and lint configurations because they are the only ones who hit the edge cases the rules need to handle. The central function holds the index.

The Procurement Bite Lands at the Runtime Layer Too

The cloud-runtime decision treated as settled is partially a usage-standards decision in disguise. Bedrock has one default for training-data carve-outs and prompt-logging retention; Vertex AI has another; the new self-hosted-sandbox direct path has its own. The choice of runtime has policy buried in it. Calling that decision "settled" because every common stack has a workable answer understates how much of the standards work was already decided in the contract before any committee met.

This is the same pattern as the May 18 piece on shadow IT 2.0. The procurement door is where the actual policy gets written. The committee door is where the discussion document gets written. The enterprises that staff operators into both doors get standards that match the runtime. The enterprises that staff operators into only one door get standards that diverge from the runtime within a quarter.

Why "The Holding Pattern Is Expensive" Is Half True

The source piece is correct that delay pushes adoption outside governed environments. The implicit claim is that an enterprise standard would have kept things governed. It would not have. Central standards never govern operator behavior. They govern the policy document. What actually governs operator behavior in a sustainable way is the operator's own standards — encoded in their own tooling, their own hooks, their own lint, their own playbooks for the workflows they actually run. The holding pattern is in fact the period when operators with that posture are building standards that will be the de facto rules by the time the official document ships.

This was the prescriptive close of yesterday's piece on opening the personal stack at the individual scale. The same logic applies at the team scale. The marketing team that has spent the "holding pattern" period encoding their actual prompt guardrails into their own internal MCP server has authored their standards. The marketing team that has spent the same period waiting for the central policy document has authored nothing.

The Next Layer Is Authorship

The vendors solved distribution. The next layer is authorship — which operators, with what depth of praxis, are allowed to write the standards that will actually govern the work. The enterprise that ships a twelve-page AI Usage Policy in Q3 will discover in Q4 that the operators who built their own standards in Q2 are the ones who set the de facto rules. The policy document and the lived standard will diverge. Every prior wave of tool adoption — version control, container runtimes, IDE choice, CI pipelines — has produced the same divergence, and the resolution has always been the same: the operator-authored standard wins, the policy document catches up, and the gap between them is the cost of pretending the central function can author what only the operator can.

Match the layer to the right author. Distribution to procurement. Runtime to platform engineering. Standards to the operators in each discipline who have done the work. The vendors did their part. The hard part — and the part that has not been "solved" by anything Anthropic shipped in the last sixty days — is whether the authoring chair gets assigned correctly inside the customer.

Wealth Management's Coming Agent Shock

Michael Tuszynski — Wed, 27 May 2026 16:30:05 +0000

The May 15 piece on access-rents drew a line through every services industry. On one side: access-rents — work that consists of operating an interface the customer cannot operate themselves. On the other: integrated expertise — work that consists of telling the customer which question to ask. AI eats both, with opposite policy implications.

Wealth management is the industry that has the largest mix of both, sitting in the same company, frequently in the same advisor. The agent shock that is coming for this industry will be uneven in a way the public discussion has not started to model.

The Workflow Surface

A wealth-management workflow, observed from the operator's side, has roughly twelve recurring operations.

Account aggregation across custodians — pulling positions, balances, and transactions from Schwab, Fidelity, Janney, IBKR, Empower, smaller firms. Transaction categorization for tax and reporting. Performance attribution against benchmarks. Rebalancing signal generation against a target allocation. Tax-loss harvesting candidate identification. Cost-basis tracking across in-kind transfers and corporate actions. ACATS execution. RMD calculation. Backdoor Roth conversion mechanics. Estate-planning vehicle structure across multiple entities. Client behavioral coaching during drawdowns. Strategy revision after a life event.

Half of those are mechanical. Half of those are judgment. The mechanical half is where agents already work today. The judgment half is where they do not, and where the durable expertise lives.

Where Agents Already Work

Account aggregation has been agent-tractable for ten years. Plaid and its competitors have been doing it under different vendor names. The underlying technology — credential vaulting, OAuth-where-supported, scraping-where-not, transaction normalization — is mature. The integration cost is non-trivial but bounded.

Transaction categorization is a solved supervised-learning problem. Consumer apps have been categorizing personal transactions since Mint shipped in 2007. The same models work for advisor-facing categorization at higher quality with feedback loops.

Performance attribution is table joins and arithmetic. Benchmark selection has some judgment in it; the math against the chosen benchmark is mechanical.

Rules-based rebalancing — if allocation drift exceeds threshold, generate the trade list — is a working agent today inside every major robo platform. Wealthfront, Betterment, Schwab Intelligent Portfolios all run automated rebalancing as the core product.

The agent layer for these workflows is not a future problem. It is a present commodity. The advisors whose primary value-add is operating these systems on a client's behalf are doing access-rent work. Their position is the same position the travel agent occupied in 2002.

Where Agents Still Do Not Work

Multi-entity tax strategy across a household — when to convert, how much, against which marginal rate, against which projected future bracket, against which Roth conversion ladder, accounting for state-level interactions — is integrated expertise. Models can generate options. The judgment that selects between options under specific client constraints is not in the model. The advisor who runs this kind of optimization for a client has a moat, not a rent.

Rebalancing under tax constraints crosses the threshold. A rules-based rebalance is mechanical; a rebalance that knows not to harvest the loss in the IRA because the wash-sale rule reaches across accounts, and that schedules the gain realization across calendar years to avoid an NIIT spike, requires the kind of context that lives in the advisor's head.

Behavioral coaching during drawdowns is the part the academic literature has been studying for thirty years and that DALBAR keeps measuring as the largest single source of advisor value. The client who calls in March 2020 and asks to move everything to cash is not asking for an answer; they are asking for a counterparty. The advisor who is that counterparty has work no agent currently replaces.

Estate planning across vehicles — a revocable trust, a charitable remainder trust, a 529 with appreciated stock from an employer ESPP, a defined-benefit plan that has been in run-off mode for six years — is the kind of multi-entity, multi-rule integrated expertise that no current model has the working context to author. Generating draft documents is easy. Deciding which draft to use is not.

The Procurement Bite

The May 17 piece on substrate argued that the policy surface for AI inside organizations is decided at the procurement layer, not the policy committee. The wealth management version of this is the data-feed layer.

Plaid TTLs are short by industry standard. Bank credential reauth typically resolves in days. Brokerage credential reauth is often shorter — most retail brokerage feeds require monthly reauth at minimum, some weekly. The actual durability of the agent layer that operates against any household's accounts depends on the worst reauth cycle across all the accounts in that household. If even one custodian has an aggressive reauth policy, the whole stack inherits that policy.

Custodian feeds vary wildly. Schwab through its standard Plaid path has different rate limits, different transaction-history depth, and different reauth cycles than Fidelity through theirs. Empower's institutional feed for 401(k) data has its own rules. Janney's broker-dealer feed has a much shorter TTL than the consumer-bank standard — recurring re-auth requests are documented as expected behavior, not as a defect. The agent that operates across all of these inherits the most restrictive constraint of the bundle.

The advisor-tech firms that decide today which custodian feeds to integrate first, and which corporate actions to normalize across feeds, are deciding the policy surface for the entire household-level agent layer of the industry. By the time the official agent rollout from any individual brokerage arrives, the cross-custodian agent layer will already be sitting on top of it, and the brokerage that locked down its feed too aggressively will find itself routed around.

The Advisor Split

The advisors most exposed to substitution are the ones whose primary value-add sits between the client and a custodian API. The advisor whose deliverable is a quarterly performance report, an annual rebalance, and an occasional ACATS — that work is access-rent. The robo platforms are already running it for ten basis points. The full-service advisor's spread above the robo is paying for relationship continuity and occasional judgment. The judgment fraction of that work shrinks as agents handle more of the mechanical surround.

The advisors least exposed are the ones whose primary value-add is multi-vehicle, multi-rule, multi-stakeholder judgment under tax and estate constraints. A family with a closely-held business, a defined-benefit plan, a real-estate concentration, and three generations of heirs needs the kind of integrated reasoning that does not collapse into an agent prompt because the constraints do not fit into a prompt. The fee structure for that work survives. The fee structure for the access-rent work does not.

This is the same split the May 13 piece on AWS Cloud Support engineers described at a different industry — the L1 work falls to agents first, the integrated-expertise work falls to agents last, the badge of having operated the L1 interface for eighteen months loses signal value almost immediately. The wealth-management equivalent is happening on a parallel curve.

What This Looks Like in 2027

The robo platforms will offer richer agentic layers, with the Plaid-style feeds plumbed in by default. The full-service brokerages will offer hybrid agent-plus-advisor products at lower spreads. The independent RIA shops that built integrated-expertise practice — multi-entity tax, estate-across-vehicles, behavioral coaching, business-owner work — will continue to charge premium fees and grow market share. The independent RIA shops that built access-rent practice will lose market share at the lower-fee end of their book and try to move upmarket.

The cross-custodian agent layer that operates across multiple feeds — household-level rebalancing, tax-aware harvesting across accounts, RMD coordination — will exist as third-party software before any single custodian builds it natively. The firms that own that layer will have the same relationship to the custodians that Plaid has to the banks. Some of them will get acquired. Some will not.

The decade-long erosion that hit travel agents between 2000 and 2010 — LA Times documented the 2002 commission elimination as the inflection — is the rough analog. The mechanical work fell to interfaces the customer could operate themselves. The advice work survived where it had been integrated expertise; it died where it had been a rent on the access surface.

Wealth management is in the 2002 moment now. The five years that follow are the redistribution.

The Five Failures That Shaped My Personal AI Stack

Michael Tuszynski — Tue, 26 May 2026 16:30:04 +0000

Every working stack is the residue of failures the operator did not see coming. The Saturday piece showed the architecture as it stands now. This piece is the inverse — the five specific incidents that produced the current shape. Each one started as a quiet bug and ended as a permanent change in how the system runs.

Failure 1: The Eleven-Day Stale Lock

On May 15 the session-end auto-commit hook tried to commit pending changes and failed. The commit attempt collided with a .git/index.lock file that had been sitting in the repo since May 3 — a zero-byte file created by a crashed git process eleven days earlier. The hook had been quietly failing every session in between, and nobody had noticed because the failure mode was silent.

Root cause: the hook had no defense against orphaned lock files. The original code assumed any .git/index.lock it encountered was held by a live git process, which is true ninety-nine times out of a hundred. The hundredth time was a process that died without releasing the lock.

Fix: a five-line stale-lock cleanup block. The hook checks for .git/index.lock before attempting the commit. If the lock exists, it checks the file's mtime against the current time — a lock older than five minutes is suspicious. If the mtime is old, the hook then verifies via lsof that no live process holds the file. Both conditions true: delete the lock. Either condition false: preserve it.

Healthy auto-commits complete in under a second. The five-minute threshold cannot race a real concurrent run. Tested across three scenarios — no lock, old lock with no holder, fresh lock with a live holder — before the change shipped.

The general lesson: hooks accumulate edge cases. The version of the hook that survives a year of daily use is the version that handles the failure modes you discovered along the way.

Failure 2: The Silently Forked Database

For eight days between May 4 and May 12, the content engine was writing to two different SQLite databases at the same time without anyone noticing. The cron pipeline at ~/services-local/content-engine/data/content.db was getting new topics from the daily trend-scan. The manual publish scripts in the same directory were also writing there. But a separate copy of the same database file at ~/.local/share/nexus/services-db/content-engine/content.db, which a broken Synology XSym symlink in the nexus path was silently resolving to, was getting the older trend-scan rows from the AI-driven path.

Both files had content rows, both had topics rows, both had publications rows, and the IDs overlapped. The reason this was not immediately catastrophic was that the disjoint content was bounded — temporal handoff between the two files happened cleanly on May 4 when the manual sprint began, so there were no genuine ID collisions, only orphaned rows on each side that the other side did not know about.

Root cause: a Synology XSym pointer in the nexus directory that had been treating one of the source files as a symlink to a different location than the canonical one. The XSym format does not behave the same way as a POSIX symlink across mount boundaries; the difference between the two had been silent.

Fix: an ID-offset merge that brought the orphaned rows from the older file into the canonical one (topics +1000, content/research/publications +100). The sqlite_sequence table got rebumped. PRAGMA foreign_key_check came back clean. Backups of both source databases were saved before the merge. The broken XSym symlink was replaced with a real POSIX symlink to the canonical path.

The general lesson: silent forks are the worst class of incident because they degrade trust in the data retroactively. Anything that reports counts, dedupes, or makes scheduling decisions against the table is suspect until reconciled.

Failure 3: The Re-Generated Drafts

On May 13 the 10 AM draft.ts cron produced two pending_review drafts for titles that had already been published in April. The system was about to ship a second copy of two pieces that had been live for weeks. The drafts sat in Slack for review and got caught before they shipped, but the failure mode was that the cron pipeline would have happily generated them again the next day and the day after that until someone noticed.

Root cause: two compounding gaps in the state machine. The content_approve handler in review-workflow.ts only advanced the content status; the topic status stayed at whatever the draft-runner left it, which meant a successfully published piece could leave its topic in drafted (happy path) or approved (if the Slack post mid-draft failed). Trend-scanner had a getPublishedContentTitles() dedupe; draft.ts did not. Then the May 12 DB merge brought two topics from the forked database in at status='approved', and the next day's 10 AM cron drained them.

Fix in two parts. A defensive guard in draft-runner.ts that imports getPublishedContentTitles, builds a lowercase Set once per run, and skips and archives any topic whose title matches an already-published title. Re-drafting becomes structurally impossible regardless of upstream state-machine leaks. A state-machine fix in review-workflow.ts that calls updateTopicStatus(content.topic_id, 'archived') when the content_approve case fires with a non-null topic_id.

The general lesson: a state machine is only safe when the invariants hold from both directions. The trend-scanner had the dedupe; the drafter did not. Now both do.

Failure 4: The 409 That Was a Success

On May 2 the Instagram carousel publish for a T3 piece returned an HTTP 409 from Late.dev — "exact content already scheduled," with an existingPostId field pointing at the post the request had just created. The carousel had successfully scheduled. The response said it had failed.

Root cause: Late.dev's API was returning a duplicate-detection error against requests it had itself just enqueued, before its internal scheduler reconciled them. The 409 was a race condition between insert and dedup-check.

Fix: a try/catch around the IG publish call that catches the 409, parses the existingPostId from the error response, and treats it as success — inserts a publication row pointing at the returned ID, marks the content row as status='published'. The fix lives in publisher.ts > publishToInstagram.

The general lesson: integrations with vendor APIs accumulate vendor-specific quirks. The fix is not to file a support ticket and wait. The fix is to handle the quirk inside your wrapper and move on. The May 2 incident produced Hard-Won Lesson #21 — the corpus reference to the broader pattern of catching false negatives at the integration layer.

Failure 5: The Named Foil

On May 4, the contrarian piece "Agentic Coding Isn't the Trap. Supervising From Your Head Is." named the writer of the original argument I was rebutting and proceeded to characterize their position in ways that pushed beyond what they had actually written. Twelve days later, on May 16, the author of the original piece pushed back publicly in the LinkedIn comments — quoting their own piece to show they had never advocated the specific thing I had implied they advocated.

The pushback was fair. The strawman risk had been highest precisely because their position was close enough to mine that the extrapolation felt safe. I acknowledged the correction publicly on LinkedIn, added an editor's note at the top of the original Ghost post linking back to the comment, and shipped a new reusable script (scripts/add-editors-note-faye.ts) that uses the Ghost JWT auth pattern to add notes idempotently to any post.

Root cause: a voice-and-discipline gap, not a code gap. Two patterns compounded — naming a foil author in the prose, and using the negative-parallelism title pattern ("X Isn't Y. Z is.") that depends on a strawman to work.

Fix: two new entries in the feedback memory. The first bans the "X isn't Y. Z is." title and lede pattern across the corpus. The second bans naming the contrarian target in prose — the link to the source piece can stay, the URL slug can carry the author's name, but the in-prose attribution does not. Both rules are now part of the auto-loaded session context. Subsequent pieces — the May 19 Goodhart piece responding to a field guide, the May 20 co-design piece responding to an academic article — followed both rules and shipped clean.

The general lesson: the corpus is the residue of editor's notes. Every voice-discipline rule worth keeping was learned from a specific incident where shipping without it produced a public correction.

What Survives

The current stack is the survivor of these five and a dozen smaller incidents I am not writing up. The pieces of it that look obvious in retrospect — the stale-lock defense, the canonical-DB symlink, the dedupe guard in the drafter, the 409 catch in the publisher, the named-foil ban in the lint — each one came from a specific incident the original design did not anticipate.

The stack is not what I planned. It is what is left after the failures pruned the parts that did not work. Anyone reading the Saturday architecture piece is looking at the convex hull of those five corrections, plus the smaller ones, plus the parts that worked the first time.

Show your stack. Show the failures that shaped it. Show the editor's notes. The thing that ships is the thing that survived.

What My AI Workflow Actually Costs Per Month

Michael Tuszynski — Mon, 25 May 2026 16:30:09 +0000

Most "AI is expensive" discourse is vague. The pieces that quote real numbers usually quote enterprise tier list prices for tools the writer does not personally run. The pieces that talk about personal use rarely quote any numbers at all.

This is the ledger for a working personal AI stack that ships a five-surface daily content pipeline, runs four cron jobs, holds a SQLite memory database, and supports about thirty published pieces a month. Real line items, monthly recurring, in the rough order of largest to smallest.

The Numbers

Claude Max subscription: $200/month. This covers Claude Code, the writing model for every draft, the editor on every revision, the OAuth identity for tool integrations. No separate API key needed for Pro/Max users on most workflows. The two-hundred-a-month tier gives me the rate limits I need for daily use; the lower Pro tier at $20 ran out of capacity inside the first heavy week.

Anthropic API spend, outside Max: roughly $30–60/month. Used for the cron-driven trend-scan, digest, and draft pipeline that runs without an interactive Claude Code session — Sonnet for T1/T2 drafts, Opus for T3, occasional Haiku for the lint-eval cycle. Spend tracks topic volume. Slow weeks land near $30; busy weeks with deep research on every piece push toward $60.

Late.dev paid plan: about $30/month. Hit the 20-post free cap on April 26. The upgrade was immediate because the alternative was bifurcating publishing across two manual flows for LinkedIn/X/IG. Current usage running 60+ posts a month across the three social surfaces means about $0.50 per post in distribution cost.

Ghost Pro hosting: $25/month. The standalone option for self-hosting at lower cost on a VPS exists, but Ghost Pro covers backups, CDN, email delivery, and admin auth for less than the time cost of running it myself. The five-surface piece always starts at Ghost as the canonical URL.

Firecrawl: $20/month. The base plan covers the trend-scan crawls and the per-piece research lookups (≥3 sources per T3 draft). Usage tracks topic generation rate, not piece count. Slow research months stay under the cap; weeks with five separate contrarian sources per piece can push over.

Dev.to, Cloudflare, GitHub, Cursor (for occasional sidecar work): $0–20/month. Dev.to is free for publishers. Cloudflare on the free tier handles DNS and Access for the dashboards. GitHub Free for personal repos. Cursor I use for one specific kind of work outside Claude Code; the free tier has been enough this month.

NAS storage and home server: roughly $15/month amortized. A Synology DS920+ bought outright in 2023, running the canonical content-engine path mount and a few Plex services. Cost is electricity plus a notional five-year amortization of the hardware. Not strictly AI spend; without the NAS the content DB would live on a $5/month VPS instead.

Domain registrations (mpt.solutions, mpt.codes, a couple others): about $5/month amortized. Annual renewals divided by twelve.

The Total

Run the line items: $200 + $45 + $30 + $25 + $20 + $10 + $15 + $5 ≈ $350/month, with seasonal variance pulling it to roughly $320–420 depending on usage.

That is the gross. The net story is more interesting.

What This Replaces

Look at the equivalent enterprise plan that would deliver the same operator experience.

Microsoft 365 Copilot Business is $21/user/month, but that covers AI in Word and Excel and Outlook — none of which is the agentic coding loop. Add ChatGPT Team at $30/user/month for the writing side. Add Cursor Business at $40/user/month for the coding side. Add an enterprise scheduling tool like Buffer for the social fanout at $100/month. Add a CMS subscription at Ghost-or-equivalent rates. Add an enterprise search API like Tavily or SerpAPI for the research crawls at $100/month.

That assembled stack runs roughly $300/user/month for the licenses alone, and produces an experience that does not include any of: persistent cross-session memory, lint enforcement against my voice guide, queue-driven scheduled publish, custom hooks against my git workflow, or the SQLite schema that makes the dedupe and the cross-platform reconciliation work. Those parts would still need to be built on top.

The personal stack costs less than the enterprise license stack and does more, because the wrapper is mine and the wrapper is where the payoff lives. This is the compose-the-stack argument from May 21 in dollar form. The vendor positioning charges you for the model and the surface. The personal stack pays the vendor for the model only, and you build the surface.

Where the Spend Lands Wrong

Honesty about waste matters more than the gross number.

The Anthropic API spend on the cron-driven pipeline is the line item with the worst yield. The AI-generated drafts get used about three days out of ten — the other seven, I write the piece by hand and run only the publisher path. The cron pipeline costs about $30/month in API calls to produce drafts that mostly get discarded in favor of human writing. I keep it running because the digest output is useful even when the drafts are not, but the marginal API call against an Opus draft I will not ship is the easiest line item to defend cutting.

The Firecrawl cost runs higher than it needs to because the trend-scan queries are overlapping — TechMeme returns about 40% the same stories as the Hacker News pull, and the Reddit subreddit list is wider than it needs to be. A focused trend-scan would hit the same signal at half the credit cost.

Late.dev is the line item with the worst risk profile. Single-vendor dependency on a fast-moving startup for three of the five surfaces. The May 2 IG 409 false-negative incident was an example of where the vendor's behavior diverges from documented contract, and the cost of switching is rewriting the publisher integration in publisher.ts. Not painful, but real.

Where the Spend Lands Right

The Claude Max subscription is the line item with the largest payoff by an order of magnitude. The work done in interactive Claude Code sessions — the actual writing, the actual drafting of these posts, the actual debugging of the publisher integration, the actual building of the personal stack itself — is what produces value. Cutting $200 there would tank the entire output. Doubling it would not change the output much, because the bottleneck is what I think and write, not the agent's rate limit.

Ghost Pro is a small line item that has zero failure modes. Self-hosting would save $25/month at the cost of recurring incidents that compound over a year. The premium for not having to think about CMS uptime is the right premium.

The lint-and-publisher infrastructure is free in operating cost and produces all of the consistency value. The compounding piece of this stack is not the line items that cost money. It is the parts I wrote myself that do not.

The Comparison That Matters

A senior engineer at a company that has not yet rolled out an enterprise AI plan can run a personal stack equivalent to mine for about $350/month, build it in a weekend, and ship faster than the team will officially be allowed to ship for the next eighteen months. The math against the eventual enterprise plan that lands in 2027 will look like a rounding error.

A senior engineer at a company that has rolled out an enterprise AI plan can run the same stack for the same $350/month, with the same independence, regardless of whether the official plan is good. The official plan being good is not a precondition for the personal stack working. The official plan being bad is not a reason to wait either.

The cost is real. The payoff is realer. Three hundred and fifty dollars a month is what it costs me to ship at this rate. The exact mix will be different for everyone. The order of magnitude will not be.

The Five Hooks That Change How You Ship With Claude Code

Michael Tuszynski — Sun, 24 May 2026 20:21:04 +0000

The dotfiles piece from May 22 named hooks as one component of a personal AI stack and moved on. They deserve more than a passing mention. Hooks are the primitive that turns taste into code — the editor's auto-format-on-save for AI work, run on the agent's actions instead of yours.

Anthropic's hooks documentation lists eight event types. Most published examples wire up one of them, demo a tiny safety check, and stop. The real payoff is in pairing the right hook to the right invariant for your work. Below are the five hooks I run, with the actual invariants they enforce.

Hook 1: PreToolUse — Guard the Destructive Commands

PreToolUse fires before any tool call, with the tool name and arguments. The hook can approve, deny, or rewrite. The high-yield use is denying classes of commands you never want the agent to run unattended — rm -rf, git reset --hard, git push --force to main, gcloud auth revoke, kubectl delete against production, anything with a flag that turns "ask first" into "do it now."

The shape of the hook is a small shell script that reads the tool call from stdin, pattern-matches on dangerous combinations, and exits with a non-zero status to deny. Examples in production:

Deny rm -rf against any path outside /tmp.
Deny git push --force to main or master regardless of remote.
Deny --no-verify on git commit unless ALLOW_NO_VERIFY=1 is set explicitly in the session env.
Deny any gh pr command with --admin or auto-merge flags.

The hook is a safety net, not a configuration. The agent already knows not to do these things. The hook catches the case where it almost did anyway.

Hook 2: PostToolUse — Auto-Lint Every Write

PostToolUse fires after a tool call completes, with the tool name, arguments, and result. The hook reads, runs whatever side effect you want, and returns. For file writes, this is where you run linting, formatting, type checking, and any project-specific guard.

The shape: a hook that filters for tool_name == "Write" || tool_name == "Edit", then runs the relevant linter against the file path that was written. In my setup this means prettier --write for JS/TS, ruff check --fix for Python, shellcheck for bash scripts. The hook does not block the agent's next action — by the time PostToolUse runs, the write has already happened. It does silently fix what it can and report what it cannot.

The win is consistency. Every file the agent writes ends up formatted the same way as every file I write. The agent's "I wrote this fast and ugly" output is indistinguishable from a deliberate commit.

Hook 3: Stop — Session-End Auto-Commit with Stale-Lock Defense

Stop fires when the agent finishes responding. This is the hook most people skip and where the highest payoff lives.

My Stop hook runs git add -A && git commit -m "<auto-commit message>" against any repo I have configured in ~/.claude/hooks/session-end-commit.sh. Every Claude Code session ends with a snapshot. I can always see what changed in the session because it is a real commit in real history.

The interesting part is the stale-lock defense. On May 15 I discovered the session-end hook had been silently failing for eleven days against a 0-byte .git/index.lock file left behind by a crashed git process on May 3. The fix was a five-line block that checks for the lock file, verifies its mtime is older than five minutes, verifies no process holds it via lsof, and only then removes it. Live process: preserved. Stale lock: cleaned. Healthy auto-commits complete in under a second, so the five-minute threshold cannot race a real concurrent run.

The lesson generalizes. Hooks accumulate edge cases. The first version of any hook works on the happy path. The version that survives a year of daily use is the one that handles the failure modes you discovered along the way.

Hook 4: SessionStart — Auto-Load Context

SessionStart fires when a new Claude Code session opens. This is where you pre-load the context your work needs every single time. The point is removing the recurring "read these three files first" prompt from your routine.

Mine loads:

The current state of the project's SESSION-STATE.md — what's in progress, what's blocked, what's next.
The relevant agent context file from ~/nexus/agents/ based on the directory I am working in.
A condensed log of yesterday's work — the last day's commits across the active projects.
The active tasks from the queue file if one exists in the project.

The hook returns text that gets injected into the session as a system reminder, the same shape as the auto-memory mechanism. By the time I type my first prompt, the agent already knows where I left off, what I am working on, and what is on the next-action list.

This is the hook that turns Claude Code from a stateless assistant into a continuous-with-me collaborator without changing anything about the underlying model.

Hook 5: UserPromptSubmit — Prompt-Level Guards

UserPromptSubmit fires before the agent sees your prompt. The hook can rewrite the prompt, append context, or block submission. Most uses I see in the wild are filters for safety words, which is the boring case. The interesting cases are project-specific guards.

Examples I run:

If the prompt contains "ship" or "publish" or "deploy" against the content-engine repo, the hook injects a reminder of the --ship flag protection and the manual-publish pattern.
If the prompt is a single command verb against a production directory (run, start, deploy), the hook injects the relevant CLAUDE.md section that explains the safer alternative.
If the prompt mentions a person whose name appears in ~/nexus/agents/personal-contacts.md, the hook injects the relevant context — old college roommate, current employer relationship, the prior interaction — so the agent does not treat the message as cold-outreach.

The hook does not block the prompt. It supplements it. The agent sees a richer version of what I typed, with context I would have had to remember to include otherwise.

What Hooks Are Actually For

The pattern across all five is the same. The hook encodes a rule I will not enforce manually because I will forget. The forcing function is that the hook runs every single time, regardless of whether I remembered to invoke it.

This is the same reason auto-format-on-save changed how teams write code in the 2010s. Not because format-on-save is technically interesting. Because the alternative — remembering to run the formatter every time — fails reliably enough that the team's code drifts from the style guide within a quarter.

Hooks for AI work are the same primitive at a different layer. They are how individual operators encode the rules the institutional plan is still drafting. The team that ships with the same lint, the same auto-commit, the same context-loading, every single Claude Code session — across every engineer — has built something the enterprise rollout document will be eighteen months catching up to.

Build yours. The five above are a working starting point.

Inside the Stack I Ship From Daily

Michael Tuszynski — Sat, 23 May 2026 16:30:04 +0000

Yesterday's piece prescribed building a personal AI stack instead of waiting for the enterprise plan. The natural objection — "fine, but what does that actually look like" — deserves a concrete answer. So here is mine, opened up.

This stack ships a five-surface content pipeline daily, on cron, with file-based memory, lint enforcement, and a queue-driven publish runner. None of it is exotic. All of it is small enough that one operator built it on evenings, and nothing in it depends on anyone else's roadmap.

The Directory Map

The whole thing lives in three top-level directories.

~/services-local/content-engine/ holds the active runtime — TypeScript ESM under src/, scripts under scripts/, drafts under drafts/, the SQLite DB at data/content.db, the LaunchAgent log paths under logs/. About 4,000 lines of TypeScript across roughly fifteen source files. Nothing in here is a framework. Each file does one thing.

~/.claude/ holds the Claude Code configuration that drives my interactive sessions — slash commands under commands/, hooks under hooks/, the keybindings file, the settings layers (global, project, local). The hooks are how I encode my own non-negotiables. The commands are how I encode the workflows I run every week.

~/nexus/ holds the agent context files and the memory index. MEMORY.md is a one-line-per-entry index that gets loaded into every Claude Code session via the auto-memory mechanism. The actual memory entries live next to it as one file each — feedback_*.md for behavior rules, project_*.md for ongoing work context, user_*.md for personal preferences, reference_*.md for pointers to external systems. Filesystem-backed, append-only, indexed, survives any model deprecation.

The Pipeline

Four cron jobs do the real work, scheduled via macOS LaunchAgents under ~/Library/LaunchAgents/ai.nexus.content-*.plist.

trend-scan at 7 AM PT pulls topics from TechMeme RSS, Hacker News Algolia, fifteen Reddit subreddits, and Firecrawl search queries. About 45 new topic rows land in topics each morning, scored on a relevance weight, status set to proposed.
digest at 9 AM PT, weekdays posts the top-scoring topics into a Slack channel with Approve/Reject/Tier buttons. I either approve a topic or reply with a URL of my own.
draft at 10 AM PT picks up approved topics, runs Firecrawl research to pull at least three sources, generates a draft via the writer module (Sonnet for T1/T2, Opus for T3), runs lint, posts the draft into Slack with Approve/Edit/Reject buttons.
publish at 11 AM PT picks up approved drafts and ships them through publisher.ts to Ghost (T3 blog) → Dev.to (cross-post with canonical URL from Ghost) → LinkedIn and X via Late.dev → Instagram carousel via Satori-rendered slide PNGs.

The schema is four tables: topics, content, research, publications. Status fields drive the state machine: topics flow proposed → approved → drafted → archived; content flows draft → lint_passed → pending_review → approved → published. Publications get a row per successful platform delivery with the external ID and external URL for later reconciliation.

The Manual Pattern That Coexists

The AI-driven pipeline above ships when I let it. Most days I write the piece by hand instead, in a drafts/*.md file under a structured header pattern — one second-level heading per surface (long-form blog body, LinkedIn body, X body, Instagram caption, hashtag lists, slide carousel JSON), parsed at publish time by the same script that runs the platform fanout.

Each manual draft gets a matching scripts/publish-<slug>.ts script that requires an explicit --ship flag — bare invocation exits without publishing — parses the draft into surface-specific content rows, calls the same publisher.ts functions the cron uses, and writes status updates back to the DB. Same five-surface fanout. Same lint records. Same publications rows. The difference is that the writing is mine line-by-line instead of generated.

Both paths converge at the publisher layer. The AI pipeline and the manual pattern are two front ends to the same back end.

The Lint Layer

src/lint.ts enforces voice. Roughly fifty banned words from my voice guide — the usual marketing-prose tells, the kind a reader recognizes on sight. Fifteen banned phrases. Word-count ranges per tier (T1: 50–200, T2: 150–600, T3: 600–2000). No question openers. No generic "state of the industry" openers. Concrete-example heuristic for T2+. Inline citation count minimum for T3 — at least three markdown hyperlinks.

The lint is the line that catches drift. It catches the banned word I almost shipped yesterday — the wrapper-pattern post originally used a different word in the backlink that lint refused, prompting me to rename and re-link without breaking the citation. It catches negative-parallelism title patterns I trained myself to write before I had banned them.

The taste lives in the lint file. Anyone reading it can see what I will not ship.

The Memory Loop

MEMORY.md is loaded into every Claude Code session at session start. It is an index, not a memory — one line per entry, each pointing to a separate *.md file in the same directory. The actual memories are typed: feedback_* for behavior rules, project_* for context that decays, user_* for stable preferences, reference_* for pointers to external systems.

This is the wrapper-pattern argument from May 3 in working form. Vendor memory is not durable across providers or model deprecations. Files are. Every memory in this system survives Claude version changes, model deprecations, and provider switches. The only operation that ends a memory is me deleting the file.

The Queue and the Wrapper Layer

A queue file at queue/posts-queue.json lists pre-drafted pieces with target dates and the publish script for each. A runner script reads the queue at noon PT daily, picks today's pending entry, executes its publish script with --ship, marks it shipped on success or leaves it pending with a logged error on failure. This was yesterday's compose-the-stack argument in working form — Claude Code as the writing worker, a hand-rolled cron-driven orchestrator as the durable runtime.

The whole orchestrator is about 90 lines of TypeScript. It does not need to be more.

What This Stack Does Not Do

It does not optimize for anyone but me. It does not have a UI. It does not have a settings page. It does not scale to a team of fifty without rewrites. It does not handle multi-tenant. It does not have a billing layer. None of those features would improve my daily ship rate. All of them would slow me down.

The point of a personal stack is that the operator and the user are the same person. The constraints that drive enterprise product complexity — onboarding, support, multi-tenancy, role-based access — disappear. What is left is the substrate, the pipeline, and the taste.

The Replication Cost

Most of what is in here is replicable in a weekend.

Skills, hooks, slash commands, and MCP servers ship with Claude Code. The publisher layer is platform SDKs wrapped in 488 lines of TypeScript. The lint layer is regex matching plus a banned-word list. The memory layer is a directory of markdown files and a one-line index. The queue runner is ninety lines.

The reason most engineers do not have a stack like this is not technical difficulty. It is the absence of a forcing function. Daily shipping is the forcing function. Once you commit to publishing every day, you find out within a week which parts of the workflow are friction and which parts are taste. The friction gets automated. The taste gets encoded in lint. What remains is the writing.

That is the stack. The components are boring. The discipline of having them all wired together is the asset.

Your Personal AI Stack Is the New Dotfiles

Michael Tuszynski — Fri, 22 May 2026 17:23:40 +0000

Every senior engineer who has shipped meaningful work in the last thirty years has carried a personal dev environment with them. Emacs configs, vim plugins, shell aliases, dotfiles repos, custom prompts, terminal multiplexer setups, a handful of scripts that exist only on their laptop and do exactly what the work needs. Nobody waited for IT to mandate the right .bashrc. The configurations that actually got used were the ones tuned to the operator, by the operator, and accumulated over years.

AI adoption is the same shape, on a thirty-year delay. The "wait for the enterprise plan to roll out" path is the same path that left people running Outlook in 1998 while the early adopters ran their own mail server with elm and procmail. The configuration that wins, again, is the one tuned to your work — not the team average.

The Institutional Lag Is Structural, Not Solvable

The enterprise AI committee, the IT rollout, the sanctioned LLM provider, the official acceptable-use policy — these are eighteen to twenty-four months behind what the team's power users already do. The cause is structural. Committees cannot iterate at the rate of an individual operator who is using the tool every day and rewiring their workflow weekly. Putting better people on the committee does not fix this; the structure itself caps the rate of change.

The historical record is unambiguous. Git was an individual-power-user tool from Linus's 2005 release through about 2010, and only became enterprise standard somewhere around 2015 — a full decade after it existed. As of the 2025 Stack Overflow Developer Survey, Git sits above 90% adoption across professional developers. The enterprise mandate followed the power-user adoption by years. Same story for Slack (founded 2013, dominant by ~2019), Docker (released 2013, enterprise standard by ~2017), VS Code (released 2015, dominant IDE by ~2019). The mandate always followed.

The people who outperformed in each of those windows were the people who adopted early, built personal infrastructure around the new tool, and accumulated workflow taste before the enterprise plan caught up. In every case, the official plan eventually arrived, and in every case it was late, incomplete, and missing the discipline-specific patterns the power users had already worked out. The same thing is happening with AI right now.

What a Personal AI Stack Actually Is

The concrete components are not exotic. Most of them ship in the tools you already have. The work is in assembling them.

A persistent memory layer in files you own. CLAUDE.md, MEMORY.md, per-project context files, an agents/ directory of role-specific context. Not vendor memory. Filesystem memory that travels with you across providers and survives any model deprecation. This is the wrapper-pattern argument from earlier this month.

A hooks system that enforces your taste. Anthropic shipped hooks in Claude Code — PreToolUse, PostToolUse, Stop, SessionStart, SubagentStop, UserPromptSubmit. The hooks are how you encode your own non-negotiables: don't let the agent run a destructive command without confirmation, lint every write, log every session, refuse to commit with TODO markers. The hook is the editor's auto-format on save for AI work.

A set of slash commands for your repeatable workflows. The five or six things you do every week — the standup digest, the PR review pass, the architecture sketch, the test triage — get encoded as one-character invocations. The commands are personal because the workflows are personal.

Skills, the procedural memory layer. Anthropic's skills documentation covers the platform-native version. The open standard at agentskills.io makes skills portable across agents — Claude Code, Codex, Gemini CLI, the Hermes orchestrator from yesterday's piece. A skill captures a pattern you have already executed enough times to formalize.

MCP servers wrapping the tools you actually use daily. Not a marketplace download. A small set of MCP integrations for the specific systems your work touches — your data warehouse, your project tracker, your finance system, your private docs. Most people will end up writing one or two themselves; the rest can be borrowed.

An orchestrator-worker compose. Claude Code as the in-session worker, a wrapper like Hermes Agent (or one you write yourself) as the durable cross-session orchestrator. The compose pattern was the argument of yesterday's piece and it is the structural answer to single-vendor lock-in.

That is the kit. None of these components is hard individually. The work is in assembling and tuning them to the actual job.

Why "The Way You Want" Matters

Enterprise AI plans optimize for the median user, which is by definition not you. The median user does not have your discipline-specific edge cases, your taste in code, your judgment about what is worth automating, the specific failure modes you have learned to anticipate from a decade of doing the work. The committee output is a lowest-common-denominator policy, and lowest-common-denominator policies produce lowest-common-denominator outputs.

A personal AI stack optimizes for the operator, which is you. The skill that captures your specific way of running a PR review will outperform a generic prompt template. The hook that enforces your team's actual code conventions will outperform the model's default style guide. The memory file that holds your project's actual history will outperform a context window that starts empty every Monday.

How the Personal Stack Becomes the Official One

This is the part the institutional planners get wrong. Every enterprise standard started as one person's hobby project. The path is consistent across thirty years of tools: someone builds it for themselves; it outperforms the team's sanctioned approach; other engineers adopt it informally; the informal pattern becomes "how we do this here"; eventually official sanction follows, or the official plan is quietly replaced by the personal pattern.

This is happening at companies right now with AI infrastructure, in places where the official plan has not yet arrived. A working content pipeline that ships across five surfaces a day with a SQLite memory database and a hand-rolled orchestration layer — for a concrete example, the kind of system the marketing team would have built if there were a paved road — starts as one engineer's weekend project and ends as the de facto company standard. The official plan eventually arrives and either ratifies the existing pattern or admits it lost.

The Honest Caveat

Some employers will discipline shadow tooling on principle. If your environment is one of those, you have to play by it. But most companies do not. Most companies have a vague "AI policy in progress" posture that buys nine to eighteen months of operator latitude, and the operators who use that window will be the ones authoring the policy when it eventually drops. The right posture during that window is the same posture senior engineers have always taken with personal infrastructure: do not ask permission for your own dev environment, ship value, let the work speak.

The Window

The official AI adoption plan at most companies will land in 2027 or 2028. It will be late, incomplete, and miss the discipline-specific work you do. The personal AI stack you build in 2026 is the only piece under your direct control. The institutional plan will, as it has every time before this, eventually follow the people who built theirs early.

Build the stack you want. Make it the official one by being the person who knew how before the committee did.

The Coding Agent Stack Has Two Layers

Michael Tuszynski — Thu, 21 May 2026 15:14:38 +0000

The current "Hermes Agent vs Claude Code" framing is the wrong comparison. The two tools live at different layers of the coding agent stack, and most of the YouTube hot takes treating them as alternatives are reading them as if they competed for the same job. They do not. Claude Code is a worker. Hermes is an orchestrator that can use Claude Code as one of its workers. The argument is not which to pick. It is which layer you are optimizing.

Here is what is actually different between them, and where each one earns its keep.

What Claude Code Is

Claude Code is Anthropic's official CLI, with native access to Opus 4.7, Sonnet 4.6, and Haiku 4.5 — currently the strongest production-grade coding models. It runs on your machine, in your terminal or IDE, and pairs with the Claude Max subscription via OAuth so most users do not need a separate API key. The native tool-use loop — Read, Write, Edit, Bash, Task, Grep, Glob — is tight, the hooks system (PreToolUse, PostToolUse, Stop, SessionStart) is mature, MCP integration works, and the recently shipped /goal command added single-session unattended completion loops in v2.1.139.

Claude Code is stateless across sessions by design. Every conversation starts in an empty room. The --resume and --continue flags restore a single recent session; there is no persistent memory layer that surfaces what you worked on last Tuesday.

This is a feature, not a bug, if your work fits inside the session. Single-machine, in-the-loop coding work — pair programming with the agent, debugging a specific issue, refactoring a module, writing a script — is where Claude Code is hardest to beat. The model quality shows up most in the lines of code that get written between tool calls, and on raw coding tasks where the answer fits the context window, the 18-task comparison published this week shows Claude Code wins its share — four of eighteen tasks went to it on raw coding chops alone.

What Hermes Is

Hermes Agent is the open-source orchestrator from Nous Research. Its v0.13.0 "Tenacity Release" shipped May 7 with persistent /goal loops, durable multi-agent Kanban with heartbeat and retry budgets, checkpoints v2, and post-write delta lint. The repository as of that release counts 864 commits and 588 merged PRs from 295 contributors — fast-moving but real.

The architectural difference from Claude Code is in three places.

First, memory is persistent and indexed. Hermes ships with a SQLite database under FTS5 full-text indexing that holds every session you have ever run through it. When you ask it to "fix the bug we were chasing on Friday," it greps Friday's transcript, pulls the relevant turns into context, and resumes. The "Honcho dialectic user modeling" layer builds a deepening profile of how you work across sessions. This is the single biggest functional gap with Claude Code.

Second, the worker model is pluggable. Hermes does not write code itself in the way Claude Code does. It dispatches the actual code-writing to whichever model or CLI you have configured — OpenAI, OpenRouter, Nous Portal, Anthropic through API, or by shelling out to the claude CLI directly. The most common production setup right now is "Hermes orchestrates, Claude Code does the typing." That is not Hermes competing with Claude Code; that is Hermes wrapping it.

Third, it runs anywhere. Six terminal backends — local, Docker, SSH, Daytona, Singularity, Modal — mean a Hermes session can hibernate on a serverless platform, resume on a phone, or run unattended on a remote box. Claude Code is single-machine by design.

In the same 18-task comparison, Hermes won fourteen of eighteen. The four it lost, it lost on raw coding. The fourteen it won, it won by remembering work from earlier sessions.

Where Each One Loses

Honesty about the weaknesses matters more than feature lists.

Claude Code's actual weaknesses: stateless by default; tied to Anthropic models (with the upsides and downsides that come with vendor concentration); no native cross-session memory of any depth; single-machine; the plugin/skill marketplace is still forming. If your bottleneck is institutional context that builds over time, Claude Code does not solve for it. You have to build the wrapper yourself, which is what the wrapper-pattern argument from May 2 is about — file-system-backed memory you own and bring to every session.

Hermes' actual weaknesses: open-source moving fast means flaky updates and rough edges. The Python dependency surface is real. Setting up the persistent memory store, configuring providers, getting the right backend running, choosing the right model for each subtask — this is operator-grade work, not consumer-grade. The codebase shipped eight P0 security closures in the v0.13.0 release notes, which tells you both that the project is being maintained seriously and that it was shipping with security holes weeks before that. The skill autocreation feature can manufacture procedural memory that is wrong, and there is no perfect way to audit a self-modifying skill base.

If you do not want to run a small piece of personal infrastructure, Hermes is not for you. If you do, Claude Code on its own leaves the persistent-memory layer unbuilt.

The Decision Matrix

The question "which one should I use" decomposes into "what is the work I am trying to do."

Use Claude Code, on its own, when: the work fits in a single session; you are in front of the machine; the answer is code that needs to be written, not context that needs to be remembered; you want the strongest available coding model with the lowest setup friction; the cost of vendor concentration on Anthropic is acceptable. This covers most ad-hoc coding sessions for most developers.

Use Hermes, with Claude Code as the worker, when: the work spans days or weeks; institutional context (project history, prior decisions, partial state) matters more than raw coding speed on any one task; you want unattended runs (overnight, cron-triggered, mobile-initiated); you need parallel subagents on a Kanban; you want provider portability so you are not single-vendor; you can absorb the setup cost of running personal infrastructure.

Use both, in different roles, when: you do daily focused work in Claude Code for the in-session productivity, and run Hermes as the durable layer for cross-session continuity. This is the pattern that is starting to dominate among heavy users. The two stop competing the moment you treat Claude Code as a worker and Hermes as the orchestrator that calls it.

The Layer Question

The framing "vs" loses something important. Most coding-agent debates this year have been arguments about features that turn out to be at different layers of the stack. The persistent-memory question is at the orchestrator layer. The model-quality question is at the worker layer. The tool-loop question can sit at either. The IDE-integration question is at the worker layer. The unattended-run question is at the orchestrator layer.

If you keep arguing about which agent is best without naming the layer the argument is at, the argument never lands. Once you do name it, the right answer is usually both, in different roles.

Claude Code is the strongest worker available today. Hermes is the strongest open-source orchestrator that wraps a worker like Claude Code. The compose case is where the real productivity lives. The vendor positioning makes them look like alternatives. The architecture says they compose.

You Can't Co-Design What You Don't Operate

Michael Tuszynski — Wed, 20 May 2026 21:59:23 +0000

An article circulating this week argues that faculty AI buy-in in higher education is a human factors engineering problem. The framing is correct. The path the piece describes skips the only two steps that matter, and the reason it skips them is structural, not pedagogical.

Start with the framework on its own terms. Human factors engineering, as a discipline, is most rigorous in the places where mistakes kill people — aviation, medicine, nuclear operations, military command. In none of those places does participatory design mean asking operators to author protocols for systems they have not yet operated. Crew Resource Management in commercial aviation was built by pilots who had logged thousands of hours on the platform. The accident-investigation literature, the cognitive task analyses, the checklists, the cross-checks — all of it sits downstream of operator-grade familiarity. Mature HFE practice in industrial settings treats prerequisite familiarity as a precondition for authorship, not as a parallel track. The order is fixed: operate the system, then design the safeguards.

The Step the Article Skips

The piece on faculty AI buy-in moves directly from "engage faculty as co-designers" to the outcomes — trust, transparency, governance, alignment with academic values. The prerequisite that holds every successful HFE program together never appears in the prose. The article asks faculty to co-design governance for tools the average faculty member has used for less than ten hours total, primarily in artificial training contexts.

What the article describes as co-design is closer to structured surveying. Faculty in a one-hour ChatGPT workshop can tell you what the demo felt like. They cannot tell you which boundaries a graduate seminar in clinical psychology needs around hallucination, or which retention defaults a research-methods course needs around student-generated prompts, or which provenance attribution rules an introductory writing course needs to keep its rubric honest. Those are the governance questions that matter. Surface familiarity produces surface governance.

What the article wants — discipline-specific, boundary-aware, defensible against edge cases — requires sustained use in the actual work. Faculty have to teach with the tool, grade against the tool, fail against the tool, and revise around the tool, for weeks or semesters, before they can author governance worth shipping. The discipline has a name for this kind of sustained operation in the actual work, and the name is praxis.

The Sequence That Makes the Outcomes Hold

The order matters because what comes out of a co-design session is exactly proportional to what its participants have actually done with the tool. A committee composed of operators who have spent a semester working through real student artifacts produces governance that survives the first stress case. A committee composed of policy interpreters who watched a demo produces governance that fails on contact with real coursework.

The fix is a sequencing change: praxis programs first, in disciplines, with real workflows and instructional artifacts, for at least one cycle of student work. Governance authorship after. The order is not optional, and the patience required to hold it is the part most institutions cannot afford politically. The faculty AI committee is sitting now; the spring catalog is locked; the student-affairs office wants a policy by July. So the committee is asked to ship governance from surface familiarity, and the result is governance theater.

The Second Step Hidden in Plain Sight

There is a second reason most institutions cannot deliver participatory design on AI even when they want to, and this one has nothing to do with pedagogy. By the time the faculty AI committee convenes, the enterprise contract has already been signed. Microsoft 365 Copilot for Education was procured eighteen months ago. The Google Workspace AI add-on, the OpenAI Edu tier, the Canvas-integrated AI tutor — all already on the books, with contract terms negotiated by procurement and counsel against the vendor's standard data-protection and indemnity language.

The actual policy surface — data flows, retention windows, opt-out defaults, training-data carve-outs, accountability allocation, liability for hallucinated outputs reaching students — was decided in that contract. What the faculty AI committee ships from here is acceptable-use guidance inside a perimeter that was drawn elsewhere by people the committee never met. Co-design at the policy layer is downstream of choices that already foreclosed most of what could be co-designed.

This is the same structural pattern that shows up whenever software arrives through the procurement door instead of the operator door. The real co-design moment is the moment the contract is being negotiated. The operators are not in that room. By the time the operators are in the room, the room has been redecorated, and the decisions that needed operator input are the wallpaper.

The Reframe

The vocabulary the discussion runs on is part of the trouble. Buy-in is a marketing term. It implies persuading a population to consent to a decision that has been made. Higher-ed faculty are operators of AI workflows in disciplines where errors compound — into student records, into transcripts, into citations, into degree credentials. Authorship is the target the framework actually requires.

Authorship requires praxis. Praxis requires sustained operation in the actual work. Sustained operation requires that the procurement phase admit it is the policy phase, and seat operators where the contract gets negotiated. The article describes the destination correctly. Trust, transparency, governance, alignment — all of those are the right outcomes. The path it draws skips the only two steps that can produce them.

What This Looks Like In Practice

For an institution willing to do the work, the program structure is concrete. A nine-to-twelve-month operator residency for each discipline before its AI governance is drafted, structured around real student artifacts and graded course outputs. A standing seat for faculty operators in the procurement workstream, with veto power on terms that touch retention, training-data use, and provenance. An explicit acknowledgment in published policy that the contract terms are the upstream constraint, named and dated, so the limits of faculty authorship are honest and visible. A sunset clause on every contract that returns the policy surface to renegotiation when the operator cohort says the boundary is wrong.

None of this is the part faculty AI committees are currently asked to produce. All of it is the part the human factors engineering frame, taken seriously, would require. The framework is right. The implementations being shipped this year are the framework with the prerequisites filed off.

Higher education will get AI governance worth defending only when the operators arrive before the contract is signed and the praxis arrives before the committee meets. Until then, what most institutions are calling co-design is a way of borrowing the legitimacy of participation without paying its operating cost.

Goodhart's Law Just Got a Slash Command

Michael Tuszynski — Tue, 19 May 2026 16:58:01 +0000

Anthropic added the /goal command to Claude Code in v2.1.139. You set a completion condition; the agent keeps working across turns; a second model reads the transcript and decides whether the condition was met. It is the built-in version of the keep-going loops people have been hand-rolling for long agent work.

A careful field guide for it circulated this week, and the piece lands the right diagnosis. A verification-only condition produces a correct-but-useless result. The worked example built a space shooter as a 960×540 canvas with a triangle, a dot, and three starfield pixels. Every machine check passed. The recommended cure is the wrong one: write better conditions, point them at a longer PRD that defines what good looks like, keep the condition short and the spec long. Better conditions do not escape this failure. They relocate it.

The Slash Command Has a Fifty-Year-Old Name

Marilyn Strathern's formulation of Goodhart's Law is the canonical statement of what /goal automates: "When a measure becomes a target, it ceases to be a good measure." Targets get optimized with full discipline. Anything outside the target does not appear in the result, because nothing unmeasured can fail the goal. /goal takes this dynamic — previously an organizational pathology — and ships it as a CLI primitive. The condition is the target. The agent is the optimizer. The evaluator-as-judge enforces the target with mechanical rigor.

The field guide does not contain the word "Goodhart," and the omission matters. Every paragraph of it describes Goodhart's Law without naming what it is fighting.

The HUD That Wasn't Checked

The strongest evidence for the structural read is buried in the piece's own conclusion. The "fixed" three-games run used per-version visual assertions — for the 70s build, an automated check asserts the renderer uses stroke and line primitives only; the 8-bit and modern builds have their own. The public repo shows the work. And then, from the closing paragraph:

The modern version's headless playtest renderer stubs text drawing, so its headless screenshots show no HUD; it renders correctly in a browser. The visual assertion passed without ever checking for the HUD, which is the same lesson one level down. It measured the effects it was told to measure, not the HUD it was not.

That is the diagnosis from the opening of the piece recurring inside the fix from the middle. The PRD got longer. The condition got smarter. The unmeasured thing — text rendering — moved one room over and the shoebox followed it. This is what Goodhart's Law does to every system that automates a measure into a target. The fix is not a stricter spec. There is no spec that anticipates the thing you didn't think to check.

Why the Loop Cannot Save Itself

The structural reason /goal cannot escape this on its own is in Anthropic's own description of the feature. The evaluator runs no tools. It reads the transcript. The field guide flags this in two separate sections — first in How to use it ("The evaluator only read the transcript. Verify the result the way you would verify a colleague's pull request before you trust it") and then again in Gotchas ("A confident summary of broken work reads as 'fine'"). Both statements are correct. Both close the case.

A verifier that does not run the artifact has not verified the artifact. It has verified the transcript. The transcript is the artifact's lawyer, not its auditor. The same model that produced the broken thing also produced the summary of the broken thing, and a second model trained on the same loss function reading that summary is not adversarial review. It is paperwork.

The Narrow Case That Survives

/goal earns its keep where the goal and the measure are the same object. Tests pass. Build is green. The queue is empty. Every module is under a size budget. Coverage is over a threshold. In that case Goodhart does not bite, because there is nothing unmeasured to subvert — you wanted the tests green and the tests are green. This is the argument from Babysitter, Auditor, Prayer. Or Tests. two weeks ago, restated: anything with deterministic verification is the right place to lean on a loop; anything that needs judgment is not.

The moment your goal includes a judgment term — looks good, is fun, has a HUD, is well-designed, feels right — you have left the domain /goal can serve. The PRD-as-context pattern does not change this. The evaluator still does not read the PRD. The evaluator still does not run the artifact. It is doing what its documentation says: summarizing whether the transcript looks like it satisfied a condition.

The Cost Ledger

The three-game example cost about 91 minutes across the three runs, plus the upfront work writing the PRD and the goal prompt. That is one half of the productivity story. The other half is the audit. The field guide is explicit about this: "Audit 'achieved' yourself. The evaluator only read the transcript. Verify the result the way you would verify a colleague's pull request before you trust it."

If you audit every "achieved" result with the rigor of a real PR review, the loop did not eliminate the work. It moved the work to a different verb. The savings are real only when the verification is mechanical and you can skip the audit because the tests genuinely passed. Outside the mechanical case, the audit is the work, and the time spent writing a longer PRD is overhead the hand-rolled loop did not have.

What Survives Contact With Goodhart

Two patterns survive. The first is the narrow mechanical case above. Use /goal for it, write a short condition that exactly equals what you want, and trust the green build. The second is a hand-rolled loop you write yourself, where the verification step is code rather than English. A loop with code-level verification surfaces missing checks as tests that do not run or assertions that do not compile. A /goal condition that misses the HUD just announces "achieved." The visible failure is the cheaper one to fix.

Goodhart's Law has been around for fifty years. Every system that has automated a measure into a target has lived through the same failure — KPIs, OKRs, SLAs, test scores, hospital wait times, sales quotas, ad-engagement metrics, every algorithmic feed. Now the pattern is a slash command. The PRD-as-spec recipe is the same trap with extra documentation.

Use the feature where the goal and the measure coincide. Everywhere else, the audit is the loop and the human is the evaluator.