DEV Community: Paul Scarrone

kwike - Agent-to-Agent Orchestration in the Unix Philosophy

Paul Scarrone — Mon, 30 Mar 2026 02:10:38 +0000

I've been building kwike, an LLM-first tool for composing agentic workflows using Unix primitives - pipes, append-only logs, and event subscriptions instead of SDKs and harnesses.

The primary use case here is agentic workflows for specific repeatable actions. What I often call drudgery. Version upgrades, dependency management, or maintaining documentation.

This isn't going to run your business or make you the next 50MM in 12hr dude, but it does allow you to get some boring stuff done while you work on the stuff you like. Its a technical solution for technical problems.

LLM First

One of the features kwike focuses on is training an LLM in its use. While it has a CLI that you can use for non-agentic tooling, building a workflow is telling your LLM of choice - in my case, Claude - "You have access to kwike, an agentic workflow orchestration tool. Read kwike --help and kwike docs. I want to build a workflow that looks something like X. Let me know when you are ready to discuss." Much of what comes next can be constructed without any experience with the tool. It includes debugging tools and static analysis to validate configuration and the workflow DAG.

While setting up a new workflow isn't immediate, rather complex behavior can be recorded and committed with your projects.

Basics

Deep down this is an agent-to-agent communication protocol through event dispatch and subscription. The tool exposes the following primitives:

dispatch (emit events to an event store)
consume (pull events subscribed to)
daemon (event store owner, can mesh across network boundaries)

The consumer watches for one or more messages defining its own partition and maintains its own read cursor. The daemon in this case acts as a mail server, essentially, and new messages are dispatched with the intent of being reactive in nature. Think of it as allowing system events or another LLM's tool use to spawn a dedicated workflow.

Each workflow step is called a uniform and is assigned to a dedicated consumer process. A uniform is a prompt template passed to the underlying tool the consumer executes. The goal here is to force the LLM to consume and produce a consistent contract so agents and scripts can exchange data. This protocol also provides some guarantee that agentic processes complete their work by requiring a JSON schema which is used to validate the agent output.

In many cases this becomes a context exchange between agents. Reminding working directories or fan-out/fan-in statuses. This context exchange is generally analogous to an email thread. A consumer can subscribe to many sources and can for claude-code force a clean context or continue a session

source: 
  types: 
    - "task.implement" # subscribed events 
    - "task.review.rejected" 
    - "task.review.approved" 

session: 
  fresh_types: [task.implement] # these create new sessions 
  resume_types: [task.review.rejected] # these resume existing sessions 
 # task.review.approved is neither → defaults to fresh

Each tool call or agent can map lifecycle events so the output contract can inform on .done and .failed. When these mappings exist a successful tool or agent completion results in a new event dispatched as a threaded reply. Allowing other consumers to participate in the work with their own uniform defining a task.

Claude-Code Caveat

Much of this has been built around claude-code and there are some limitations towards other agents right now but this is what I have so far. Claude for example as the tool supports session resume which permits retries and replies to maintain a conversationally bound session to speed things up.

Couple Line Example

Here is an example of "watch" which is a convenience function to auto "dispatch" from any command output on an interval. Requires a running kwike daemon.

# watch a git repo, dispatch changes, consume with an agent
kwike watch "git diff --stat HEAD~1" --type repo.change --interval 60s
kwike consume --config ./agents/writer/consumer.yaml

Anyway I have plenty of examples in the repo, https://git.sr.ht/~ninjapanzer/kwike

There is plenty left to do but I would love to hear some feedback on the architecture and assumptions around contract constraints as a mode to make LLMs behave. The crash-only approach, and the message durability that mirrors Kafka but acts more like a newsgroup server and mail clients.

Keep Your Eyes on the IDE, and Your Robots on the Tickets

Paul Scarrone — Sun, 08 Feb 2026 00:00:00 +0000

Keep Your Eyes on the IDE, and Your Robots on the Tickets 🔗

Initial Scene:

Narrator: Bead Manager?! What does that even mean... let's start back at the beginning:

Scene Break: (Dissolve)

Time Jump:

"Two weeks earlier..."

New Scene:

A tall handsome man with thick dark hair leans over a computer with boxes of black bordered in grey scrolling dark green text. Scowling...

Author enters the room

Author: Who the hell are you! Get away from my laptop! Freaking coffee shops...

The Hero's Journey 🔗

As you can imagine I have been following the post transformer LLM growth for about 4-5 years at this point. I didn't understand it and I never really used it but I keep my ear to the ground. Increasingly frustrated with the inability to keep the LLM on task. I mean its ignorance on my part and the tool isn't ready yet. Such is the mark of progress, things improve over time. Although I am still challenged with simple things.

Give me 20 variations of this prompt for as jsonl training data using X format

I get 8...

I get 23

I get 12

Jump Cut:

Laptop launches out the window

So that's problem one and how do we solve it? Well with a novel wrapper that counts outputs and then re-prompts to do it again. I think they call that the Ralph Loop, I don't, I just call it the nature of the thing.

I learned later that this is generally caused by ambiguity of the context. Asking for 1 item 20 times and feeding back in the previous set to avoid duplicates always works better. The teaching: the computer is dumb, don't make it think too hard and everything goes smoother.

Most of what is to follow is the application of Agentic Patterns: Elements of Reusable Context-Oriented Determinism

Beads 🔗

What beads provides is really just an idea and its worth exploring yourself: https://github.com/steveyegge/beads.

It self describes as "A memory upgrade for your coding agent," which I think is arguable but it was the trigger I needed to expand my concept of what a workflow with an LLM could look like. To be honest I didn't just go "Ah Beads! Its all clear now." Instead I found this article about Gas Town which I didn't read, thanks ADHD, and instead installed it blindly. If I was to give it a review it would be that Gas Town is kind of a meme of agent orchestration. Clearly, there is a lot of work put into it, but I think the author might agree that its an expression of an idea in a more artistic than practical form.

But who cares, I walked back from Gas Town to beads, the underlying magic, in my opinion. So I describe this as a context graph, I am able to manually or with an agent LLM extract just as much focused context as I want and use it as a concrete repeatable prompt. While the same prompt doesn't get the exact response each time, the same prompt gets generally the same tool use and generally the same code is constructed. Which makes me wonder if the variability of code is so limited by its grammar restrictions that LLMs have less predictive options to bias towards.

Ok, I am gilding the lily a bit, a bead is just a bug ticket or a todo list and its a prompt that has a dependency chain. How I am using it is more like Jira for robots, if Jira wasn't software designed for my suffering. I am able to build a feature and break down tasks then feed a path of those tasks to the agent.

You may be asking, but why not just use markdown files or JSONL. Well because I am a human and I hate reading JSONL files, I have ADHD so if the file is longer than 10 lines it will never be fully read. Better put what you want as the last line on the bottom cause thats all I see. Point is I need to be able to monitor, tune, and track the agents. See what Gas Town did was have the agents self-manage. While novel, its a bit bizarre when you are trying to avoid scope creep cause LLMs love to add features.

Back to the other question, why not markdown files. Well, two reasons, first they are kinda noisy, second if the LLM has to read more than the exact section of the file they are working on some ambiguity could be introduced. If you notice the agent will often scan a file 50 lines at a time if there is no index. Which means some of that ends up in its context. When we want determinism our first goal is to make sure each interaction is exactly the same prompt. This means beads is mostly an opinion and is probably not required.

Stay in the IDE and Manage your robots 🔗

So good choices after bad maybe but when I have a database for my tasks and their prompts I need a way to visualize it. The purpose here is to allow me to create and observe the tasks my agent orchestration is running on. For me this is just Claude Opus delegating tasks to Sonnet agents in an agentic loop.

This all started with this command bd graph --compact --all

All because I wanted to watch my agent orchestration work through my tickets for another project.

Well that has led to this:

A full management console that lets me watch the beads transition status but also let me edit and add comments.

In this video there is an experimental refinement mechanism being demonstrated, available in the current release: Jetbrains Marketplace

The workflow 🔗

So the other half of this tool is this set of prompts for claude: beads-orchestration-claude

Now this is for claude but the practice can be applied manually or using other agents, the pattern is what matters and the prompt helps encapsulate the pattern more than the agent.

The keys here are:

Recoverable
Durable
Keep your eyes in the IDE

1. Planning 🔗

So our first path here is to plan out a feature. This is really the only time we have a discussion with the LLM but my recommendation is to write a brief in a markdown file. A musing is good enough where you describe the problem, some technical planning around constraints and the systems you want to support.

What you do for any brief, use-cases, goals, non-goals, definitions, and open questions.

Once this is prepared you hand this over to the agent. For me that uses the /new project command REF if we provide it with project name readme git remote url it will setup a baseline project with beads using some LLM magic and a bash script read the brief and prepare the project with a proper explanation of the project for CLAUDE.

Once we have a nice agent specific write up for the project, which is important, we can begin planning. Beads provides some tools that will naturally be injected into your project to help the agent. But you may need to tell your agent this

use beads bd to plan out tasks for this project, bd prime for an overview of commands

bd prime exposes an agent friendly output for how to invoke commands.

Your agent should now be creating issues in beads for your project. Depending on how you like it you can use as many or as few features as you like from beads, which has a number of fields to hold context about actions. At the very simplest you will get titles and descriptions. If you asked for a feature or an epic you will find they may have been mapped as dependencies.

You should then review the tasks. This can be done with bd list and bd show id or use the jetbrains plugin.

2. Review 🔗

So now we review the beads and expand / contract the plan asking the agent to defer tickets we are unsure about or expand others.

3. Work Breakdown 🔗

This is probably the most important part. Ask a reasoning model to review all the beads and provide implementation details for those exact tasks in the beads. The idea here is to have the agent make a big plan but instead of writing all the code write code snippets that are attached to the tasks.

We can then take the vibe code approach and execute on this or do a pre-review of our code. Its not uncommon for the agent to have wandered down a bad architecture path. Here is our moment to focus on a specific task and a specific ticket and allow things to be revised in a focused way.

The best way to do this is to first clear your context and ask:

Given the project overview please review bead and revise to include an a single refresh flow for all data sources. Also review implementation details.

4. SDLC 🔗

Tell the agent to now make documentation and testing tasks linking them as required to the beads that relate to them. You should end up with a layer 2 of tasks that will follow up after the implementation completes.

I usually then ask:

Given the use-cases in the project overview define an e2e testing ticket for planning e2e tests that we can review at the end.

If all is well the agent should create a task that it will stop and design testing with you that include acceptance criteria based on the provided use-cases.

5. Burn tokens 🔗

Now we get to the more technical part. We need to delegate actions to sub agents and depending on what agent infra you use this could be built-in or require manual orchestration.

The command /beads-orchestrate REF handles most of the heavy lifting.

It instructs the orchestrator to fork new processes using a template. For claude this means it will append

--dangerously-skip-permissions --model sonnet|haiku --print -p "..."

For the prompt it will read the bead details along with some workflow about updating the bead and write that new prompt to a temp file. Passing the temp file to the new process. This obviously gives you the ability to debug what is happening at the injected prompt level.

It then sleeps and waits for the sub process to finish.

Why? Well Claude is just a nodejs app and it eventually runs out heap space because it reads the stdout and stderr of all tasks it orchestrates internally. As a subprocess it watches, it's a fresh Claude instance so if it fails it fails in a recoverable way. Since the prompt file is named after the bead it can recover by just restarting the agent.

At this point the orchestrator should spawn the implementer which reads the implementation details and completes the work.

Then the orchestrator will spawn a new agent to handle code review, usually a simpler agent.

All this time the agents will leave comments on the tickets so you can see where it ran into problems and picked a tradeoff. This is very important for root cause analysis later if a feature degrades. You can have the LLM resurrect the changes merged into a branch the same name as the bead. Review the decision it made and make another one. Better yet, since the orchestrator and implementer read the comments you can just append your request to the ticket, mark it open, and tell the agent to refactor it again.

Here is an example of what a bead comment thread looks like in practice. This is from an issue where the Gradle build was failing:

[Opus] at 2026-02-06 01:11 Starting orchestration. Dispatching Sonnet implementer to fix Kotlin stdlib warning in gradle buildPlugin.

[Opus] at 2026-02-06 01:16 Resuming orchestration. Dispatching Sonnet implementer to fix Kotlin stdlib warning.

[Opus] at 2026-02-06 01:24 Resuming orchestration. Previous worktree had no commits - starting fresh. Dispatching Sonnet implementer to fix Kotlin stdlib warning.

[Sonnet] at 2026-02-06 01:25 Starting implementation. Will examine build.gradle.kts and gradle.properties to understand current configuration, then apply fix per https://jb.gg/intellij-platform-kotlin-stdlib

[Sonnet] at 2026-02-06 01:26 COMPLETED: Added kotlin.stdlib.default.dependency=false to gradle.properties. Build verified successful without warnings. Fix committed to branch.

Notice how the orchestrator (Opus) had to resume twice - once after the first dispatch seemingly stalled, and again when it found the worktree had no commits. This is the kind of recovery that happens automatically. The implementer (Sonnet) then picked up the task, did its research, applied the fix, and verified success. All of this is visible in the ticket history without watching terminal output scroll by.

6. When it fails 🔗

This workflow isn't perfect but thats the big reason for the plugin. This whole process keeps you from staring at the chat stream and back into the IDE as your work. If you see progress not being made or an issue has comments that move it to blocked you can address it there and then just kick the orchestration. The goal is we have boring work we don't wanna do and we let the robot do it while we act on the interesting parts.

But sometimes it just hangs, haven't solved it yet. When this happens we are always recoverable. Claude subprompts have a 10 minute timeout so even orphaned they will be killed. You just start orchestration again on a clear context and things recover without your attention.

Automatic Programming: Iteration 4

Paul Scarrone — Sun, 08 Feb 2026 00:00:00 +0000

Automatic Programming: Iteration 4 🔗

While I know that the discourse is not complete binary whether you are for or against LLM generated code it's probably the right time to take a step back a few years and explore the iterations of our industry.

Let's just work backwards, COBOL, iteration 3.

Common Business-Oriented Language A weak, verbose, and flabby language used by code grinders to do boring mindless things on dinosaur mainframes. Hackers believe that all COBOL programmers are suits or code grinders, and no self-respecting hacker will ever admit to having learned the language. Its very name is seldom uttered without ritual expressions of disgust or horror.Evans, Claire L - Broad band the untold story of the women who made the Internet_ -> From The Hacker's Dictionary

Yes, the perpetual software of big finance and numerous other systems. While I haven't written COBOL myself I have had to rewrite at least one system written in COBOL.

Poking fun aside COBOL exists because of the dream that non-experts could be computer programmers. By today's standards you still need to be an expert to write COBOL. Prior to its creation, software was written in assembly and before that machine code, and before that patch cables, actual wires, iteration 1.

Grace knew that would only happen when two things occurred:

Users could command their own computers in natural language.

That language was machine independent. That is to say, when a piece of software could be understood by a programmer as well as by its users, when the same piece of software could run on a UNIVAC as easily as on an IBM machine, code could begin to bend to the wills of the world. Grace called this general idea "automatic programming"...Evans, Claire L - Broad band the untold story of the women who made the Internet_

That Grace of course was Grace Hopper, and she was obsessed with making programming easier and more efficient. In her time programming was a kind of wizardry and very few knew the incantations to make the computer operate. From a business standpoint making programming easier was bad for business since computer companies sold the computer and the software.

Portable programs, ones that could be written on any machine for any other machine was a business risk. It created competition so there was resistance.

Those who resisted automatic programming became known as "Neanderthals." They might as well have called themselves framebreakers, as Lord Byron had over a century before.Evans, Claire L - Broad band the untold story of the women who made the Internet_

"Framebreakers" refers to those workers who opposed the automatic loom, better known as the Luddites.

Before computers cloth was made on the loom and the origin of the punchcard was used to create an automatic loom. After its invention there was not much use for using a manual loom. It was disruptive and changed an entire industry, displacing workers.

Grace and her cadre believed in a future where the programs write themselves. There is some corollary to today with code generation, which actually is programs writing themselves, something Grace dreamed of. The difference between the loom and the compiler is only in the growth potential. Cloth was an end result of a chain of optimization but while you could tirelessly create it in any pattern imaginable, someone still needed to imagine the patterns. Variety of cloth became commonplace, the art remained, the drudgery was lost.

Now I respect that for some the act was the value, I share those feelings. I love writing the actual code, I care about it more than the product it produces. An opinion whose popularity depends on what side of the invoice you sit.

A quick lesson: computers do not understand English, French, Mandarin Chinese, or any human language. Only Machine code, usually binary, can command a computer, at its most elemental level, to pulse electricity through its interconnected logic gates.Evans, Claire L - Broad band the untold story of the women who made the Internet_

Programs are essentially the aggregation of basic operations, layers upon layers. If we think of code generation as just another kind of compiler it's part of a long lifeline of change approaching the ideal "automatic programming."

Of course I would prefer to ask Grace her formal opinion. But in her time when presenting her arguments, mathematicians were once inundated in the tedium of arithmetic to solve their equations. Computers arrived and essentially removed the need for those steps and allowed them to get closer to the interesting part, the solutions. She argued that the compiler did the same thing, modulating the complexity of using computers allowed programmers to spend more time on stimulating thoughts.

Of course the reality was mathematicians became programmers to advance their work.

Programmers used compilers to build elegant languages, and COBOL.

I think it's quite funny to have the perspective that programs are binary, iteration 2, and writing the program for the computer was to create something that could accurately generate binary programs.

Now we stand at another transition where automatic programming is telling the computer to write the code that the compiler turns into a binary program. If the compiler was the 3rd level operation we are now at the 4th.

Here is where it all came together though: 🔗

Grace loved coding, but she admitted that "the novelty of inventing programs wears off and degenerates into the dull labor of writing and checking programs. This duty now looms as an imposition on the human brain"Evans, Claire L - Broad band the untold story of the women who made the Internet_

I have been feeling this for years, the code just keeps getting more repetitive. I just keep doing the same extremely complicated and extremely boring operations over and over again. The novelty of software I grew up with in the 2000s is over. Everything is a framework or a dogma and all solutions are solving the same problems with a different color scheme and font.

If anything would make me embrace the 4th level it's this, even if it means no one needs me anymore. I can at least see the realization of Grace's dream and in some way if everyone becomes a programmer finally, I'll have more people to talk to about what I love.

We aren't there yet, the software world is still pretty complicated and you have to know a lot of special dance moves to get things working right. But it's not going to be forever.

The book 🔗

Before I wander off into a diatribe of where our future is going lemme just stop and tell you to read this book:

Broad Band - Claire L. Evans

It's a good one, it has changed my mind on if I will continue to call myself an engineer or a programmer. The definition has finally been clarified. I was today - 2 weeks old and that was too long to know the truth.

Also it answers the question of why women were not present in Computer Science but I was taught by women who had careers in computer science.

Point is strong recommendation.

Where are we going? 🔗

I dunno, maybe we are all out of work. Maybe our MBA Degree bosses will finally see their reality of the numbers going up and to the right forever.

I see it like this, programming ended when the job absorbed all its roles by mere definition. It has been an amalgam for a while and that has been a crime. Now I can focus on building things again at the scale required in our times. I can concurrently build 2 or 3 projects while focusing on my writing. Sounds like a dream and the troubles of today are not forever.

The current goals of centralized AI is unsustainable and within a few years the ASICs will arrive, our computers will be packed with high bandwidth memory and the models will be local. Just like how all of a sudden we all started walking around super computers in our pockets we will build the infrastructure to build all the hardware we need to move forward.

I mean we live in the dumbest timeline and greed seems to be winning but if the pattern from past is here to loop again we go from the dumb time to the bright time for a while again. I am looking forward to that at least.

Agentic Patterns: Elements of Reusable Context-Oriented Determinism

Paul Scarrone — Fri, 06 Feb 2026 00:00:00 +0000

Agentic Patterns: Elements of Reusable Context-Oriented Determinism 🔗

While not as exhaustive as the title might indicate but aligned with my focus on enforcing as much determinism as possible from any given LLM ala Article let's take a look at exploiting tool using LLMs as a process instead of as a conversation. As I posed in the linked article much of the failures we experience are related to attention and confusion which is the progressive noise we introduce as we try to convince the model to perform an action.

What I describe below are patterns for building A Deterministic Box for Non-Deterministic Engines

Chats are an artifact 🔗

This behavior of progressing the chat with multiple statements to a solution is merely an artifact of pre-tooluse models. So we the humans needed to interact with moving files and integrating code at each step while testing it became natural to turn interactions into long conversations. Ones that eventually degrade into failure loops, while surely someone has told you to just keep clearing your context and start over.

Since the evolution of tools like functiongemma which provides trainable, simple function calling on commodity hardware we are on the edge of building decision trees for tool oriented expert systems, but that's a topic for a different day. For now the models we have that are effective tool users are too large to be portable and our contract is still text.

Reduction in variability 🔗

You may recall from math class that you should avoid deriving new values from derived values until you can prove the quality of the procedure. As any instability in accuracy will grow the inaccuracy of outputs. The same is essentially the normal behavior of long running chats. Since model responses can essentially steer (influence) the decisions of the model in the future of the same context window we can fall into a quality trap.

import anthropic

client = anthropic.Anthropic()
messages = []

while True:
    # Get input from you
    user_input = input("You: ")

    # Put your message in the context
    messages.append({"role": "user", "content": user_input})

    # Send the context to the LLM
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=messages, # Full history sent each time
    )

    # LLM response
    assistant_message = response.content[0].text

    # LLM response added to context
    messages.append({"role": "assistant", "content": assistant_message})

    print(f"Claude: {assistant_message}\n")

    # Loop

As this partial example shows if we send 3 messages there will be 3 responses and our context is 6 messages. Each time we send something new the LLM rereads the entire context not just the last message meaning any derived issues that we or the LLMs predictive invariability adds can pollute the quality of the overall decisions made. There are also unseen patterns due to how training is compressed that leads to non-intuitive work and concept replacement for convoluted examples. For example the LLM will be more accurate if there was more reinforcement during deep learning of that topic and it fails faster when exercising in novel space. If you wanna understand this better go read this book: https://sebastianraschka.com/llms-from-scratch/

KISS the LLM 🔗

So the solution is the same as it ever was. Keep things small and focused on a single task. The LLM isn't a person and doesn't think, we are using human language to steer outputs the same way we write function signatures to have enough context to supply information to downstream operations. That doesn't mean we never chat with the bot. It does have a big context window and we can take advantage of that for specific patterns. Plan-Then-Execute Pattern

As discussed we want to keep context focused when we need a long running session. This is the key to plan-then-execute. Coding agents' system prompts have been biased towards creating implementations which like an eager intern jumps the gun and starts building before understanding. When this happens we find ourselves immediately refactoring the wrong idea. The context becomes polluted with examples of the wrong solution and leads to lower quality outputs.

While some coding agents have a "planning" mode, this is a system prompt hack to try and keep it from producing but I'll admit I have had lower luck because this funnels you towards implementation faster. The solution here is to work with the agent's bias to produce and have it produce research artifacts. It will gladly deep dive into a code-base and provide elegant descriptions of architecture and sequence. This is best performed with a reasoning model.

Kill-Then-Breakdown Pattern 🔗

A sub-step of plan then execute requires context flushing. After we have verified the quality of the research we start fresh and have the next agent, preferably a reasoning model, read that research document and we instruct it to break down the work into tasks and provide a planned implementation to each task. Once again we are working towards the models goals of writing code or producing files and we get small snippets of code associated with each task. The plan and breakdown is a token heavy portion of the work stream but since we keep check-pointing with artifacts written in markdown there is a repeatable retention in value. That said context size does play a part in cost so flushing the context and loading a compacted version of the topic does end up saving some cost over the context exploding and the risk of loss during compaction.

Now Execute (Mr. Meeseeks Pattern) 🔗

While you can use a single reasoning model to go through each task it has broken down and implement it there is a better way. We can ask the reasoning model to act as an orchestrator that spawns sub-agents of cheaper models for each task in the Mr. Meeseeks pattern. The reasoning model starts a simpler model and passes it the task and expected implementation we just broke down and goes to work on it. For simplicity's sake don't run these operations in parallel yet without some considerations to keep multiple agents from overwriting themselves. As each task is marked completed the sub-agent will be killed and a new one with a fresh context is started.

It's important to remember that the orchestrator gets the output from the subagents so if your development environment produces a lot of noise or if your agents aren't clamping their read size you may run into interesting scenarios where you overflow the coding agents memory. The solution here is to run each sub-agent in a new process instead as a task run by the first coding agent. I am sure you can see how this can expand.

Specification-Driven Agent Development Pattern 🔗

Is what we just accomplished. During planning we created a specification from an existing code-base or a set of discussions. Then we captured focused implementation details. Then we did a bunch of tiny implementations. While the more formal nature of spec driven usually stops at the original manifest of "what is this feature going to be" we should take it one step further to actually storing partial facts about implementation. On the consumption side of the coder agent it will be somewhat literal with what it was given but it has to perform integration and resolve writing tests as well as ensure the work fits into existing tests and functionality.

Also given we have this spec we can add extra steps to our workflow. The orchestrator ends up following a very simple workflow and its focus is retained around the same document of compressed knowledge it wrote. This is an important fact because models are specifically expressive, they talk a certain way, which means a model reading what it wrote is less ambiguous than it reading what you wrote. It has enforced patterns from training we can reactivate.

Agent Verifier Pattern (Code Review) 🔗

Since we have all these concrete artifacts regarding code and spec and final implementation we can then as our last step ask a small simple agent to just give us a thumbs up or down on it, essentially a code reviewer. Before we determine something is done we allow a new context to observe just the changes and the spec, if it rejects we spawn a new implementer to try again. Then we spawn a new reviewer and review again until it works.

In practice this interaction looks a little something like this:

During the end of this task the image to be added wasn't correct and the reviewer failed it causing it to loop.

Prompt Writer Agent Pattern 🔗

So this doesn't work out of the box but it's pretty easy mock up. The next step is to codify what we send during each phase of execution. For this to work we need to be very explicit. Even though the orchestrator knows the workflow it may forget as it handles agent spawning which leads to the workflow rules not being transferred to the sub agents. We are in a derived value degradation problem again.

We have to help the orchestrator by providing it a template of the actions we want each sub-agent to take. It can fill in the gaps with the task. So before it spawns an agent it reviews the workflow and writes the sub-agent prompt to a file. It then tells the sub-agent to read the file and implement. This provides two benefits to accuracy. Since the orchestrating agent has to keep re-reading the template ala RE2 (Read and Re-read prompting) it retains more attention because it keeps getting repeated in the context. Since it then writes the refined prompt for the agent if we crash or context collapses we can immediately recover by reviewing overall task process and the presence of the prompt files. It is in fact highly durable allowing multiple orchestration concurrently if you have the money.

Additionally, the reviewer will get its own prompt written but it can review the coders prompt when checking for spec compliance.

In Practice 🔗

If I align this to Anthropic models:

Orchestrator -> Opus (Reasoning)
Implementer -> Sonnet (Competent)
Reviewer -> Haiku (Simple)

I also don't rely as much on markdown files past the very first phase of planning. I move all context with the exception of sub agent prompts into a graph. For that graph I use beads which while it has its flaws enables an approach I call the "Context Graph Pattern" which I will go into in a bit.

What beads essentially is is Jira or Linear but with a outputs that work better for LLMs. Essentially a command line tool that has a help dialog that outputs markdown instructions, which improves comprehension by the LLM. It's a graph because like any issue tracker issues can form chains and comments.

In the previous picture above this comment stream is from a plugin to interact with my graph visually. It permits me to leave comments for the agents or even rewrite a spec on the fly.

Context Graph Pattern 🔗

Using a tool that allows me to commit context as a focused structure means I get reproducibility and an audit log. Since beads uses issue IDs as commit names the graph extends into the git history. Code and spec and decision tree can all be one artifact without reading all the files. This keeps our context as tight as possible.

Because the graph is mutable if the first attempt was a complete failure I have two choices:

Provide feedback as a refinement and retry -> Refactor
Rewrite the spec and have the agent pull the previous changes and start over -> Rewrite

I can continue to iterate this way at a much lower time cost to me as a developer and since the graph is also able to be committed to a repo and shared with other developers they can do the same.

When we enhance a feature we can include the previous changes either by diff review and spec retrieval from the graph or by explicit linking within the graph itself. There is a portion of this structure that lets you act as Product, Project, and Tech Lead for the given outcomes.

Of course no silver bullet, you will end up a developer for some things in the end don't worry. But this can be guided by a concrete context for yourself when you ask the LLM what it thinks went wrong and you get your hands dirty.

Example 🔗

If you wanna see a functional example of this process I have been dog-fooding it for a bit and all the artifacts from the plugin I posted a picture of are over here.

https://git.sr.ht/~ninjapanzer/jetbrains-beads-manager

Install: https://plugins.jetbrains.com/plugin/30089-beads-manager

A majority of this code was written in my absence in a execution loop. This usually gets you about 80% there. I then spend some time filing bug tickets adding clarifications and refinements. There were only 3 actual chat sessions that occurred during integration where I provided some focused behavioral examples and some bulk documentation where it built some new tasks and orchestrated them.

I would call this a mature alpha as it was produced in 1 sitting. Functionality is usable enough that I finished the development only using the plugin. But this isn't showing off If you pull this down and have beads installed you can see my prompts and what an actual context graph looks like.

The point 🔗

Is not to replace humans as the engineers but replace the grunt work. That said the pattern is implied by the use case. If I am building silly tools for myself, who cares what the code really looks like. If I am building functionality I have to rely on, I need to put considerably more agency in the matter. I will still offload the grunt work when possible but it still a practice. I would hope my carpenter would cut a few less corners on my cabinets than their own. It's not that we are lazy it's we exercise our agency in a way that is comfortable for us. What we build for others must be of the highest quality, what we build for ourselves needs to meet the need.

I mean who knows what will happen and if greed will win and our work will be meat base robot pooper-scoopers. Until everyone figures it out get more work done and take a few more coffee breaks.

Rust Dancing ANSI Banana with Server-Sent Events

Paul Scarrone — Sun, 01 Feb 2026 00:00:00 +0000

Remember that dancing Ruby banana? 🍌

Well, I couldn't help myself. After building the Ruby version with chunked transfer encoding, I started wondering: what if we explored the other way to stream data to browsers and terminals? Enter the Rust implementation using Server-Sent Events.

Yeah, I rewrote it in Rust. With SSE.

So here's the thing: when you want to stream data from a server to clients, you've got options. My Ruby version uses chunked transfer encoding—basically HTTP/1.1's way of saying "I'm sending you data in pieces, and I'll tell you when each piece ends." But there's another player in town: Server-Sent Events (SSE), which is a proper protocol built on top of chunked encoding for one-way server-to-client streaming.

Why both? Because understanding the difference matters when you're building real streaming applications. Plus, Rust's async ecosystem with Actix-Web makes SSE implementation surprisingly elegant.

The best part? It works with both curl and web browsers. Same endpoint, different experiences. Curl gets raw ANSI animations, browsers get properly formatted SSE streams. One server, two clients, zero compromise.

Want to see how SSE differs from plain chunked encoding? Grab the code at sse-dancing-banana and follow along. Or if you just want to see a banana dance: curl -N http://localhost:8080/live

Bottom line: Sometimes the best way to learn a protocol is to make something completely silly with it. And what's sillier than making fruit dance in your terminal?

Hope your terminal's ready for some Rust-powered dancing! 🍌🦀🎵

DevLog 🔗

02 02 2026 🔗

SSE vs Chunked Encoding: What's the Difference? 🔗

When I built the Ruby version, I used chunked transfer encoding directly. It's HTTP/1.1's mechanism for streaming—you send data in chunks, each prefixed with its size in hex, terminated by a zero-length chunk. Simple, direct, low-level.

But SSE is different. It's a protocol built on top of chunked encoding. Think of chunked encoding as the delivery truck, and SSE as the carefully labeled packages inside. SSE defines a specific text format for events:

data: <your content here>
data: <more content>

Each event ends with a double newline. You can have multi-line data (prefix each line with data:), event types, IDs for reconnection, even retry hints. It's structured, and browsers have native EventSource API support.

Here's how the Rust code handles both in the same endpoint:

async fn live(req: HttpRequest) -> impl Responder {
    let user_agent = req
        .headers()
        .get("User-Agent")
        .and_then(|h| h.to_str().ok())
        .unwrap_or("");

    let is_curl = user_agent.contains("curl");

    // ... speed parameter parsing ...

    let stream = stream::unfold(
        FrameStream { current: 0, interval, is_curl },
        move |mut state| async move {
            actix_web::rt::time::sleep(state.interval).await;
            if state.current >= FRAMES.len() {
                state.current = 0;
            }
            let frame = FRAMES[state.current];
            let data = state.format_frame_data(frame);
            state.current += 1;
            Some((
                Ok::<_, std::convert::Infallible>(web::Bytes::from(data)),
                state,
            ))
        },
    );

    HttpResponse::Ok()
        .content_type("text/event-stream")
        .streaming(stream)
}

The magic happens in format_frame_data. For curl, we send raw ANSI:

fn format_frame_data(&self, frame: &str) -> String {
    if self.is_curl {
        // Chunked encoding: just send the frame with ANSI clear codes
        format!("{}{}\n\n", ANSI_CLEAR, frame)
    } else {
        // SSE: format according to the SSE protocol
        let cleaned = self.strip_ansi(frame);
        let lines: Vec<&str> = cleaned.lines().collect();
        let sse_lines: Vec<String> = lines
            .iter()
            .map(|l| format!("data: {}", l))
            .collect();
        format!("{}\n\n", sse_lines.join("\n"))
    }
}

See the difference? For curl, we're just sending data. For browsers, we're wrapping each line in data: prefixes and preserving the SSE format. The browser's EventSource API automatically parses this.

Why does this matter?

Reconnection : SSE includes automatic reconnection with Last-Event-ID. Chunked encoding? You're on your own.
Browser Support : EventSource is built-in. Chunked encoding requires manual fetch() streaming, which is newer and less supported.
Event Types : SSE lets you send different event types on the same stream. Chunked encoding is just bytes.
Simplicity : For server-to-client streaming, SSE handles the protocol. Chunked encoding is just the transport.

When to use what?

Chunked Encoding : When you need low-level control, binary data, or don't care about browser niceties. Think raw terminal streaming, like the Ruby version.
SSE : When you want browser compatibility, automatic reconnection, structured events, or you're building a real-time notification system.

For this project, SSE won because I wanted both curl and browser support without writing separate endpoints.

02 02 2026 1 🔗

Rust's Async Streams: The Good Parts 🔗

Coming from Ruby's Sinatra with its simple stream block, I expected Rust to be painful. It wasn't.

Actix-Web's streaming response is built on Rust's Stream trait, which is like an async iterator. You create something that implements Stream, and the framework handles the rest:

struct FrameStream {
    current: usize,
    interval: Duration,
    is_curl: bool,
}

impl Stream for FrameStream {
    type Item = Result<web::Bytes, std::convert::Infallible>;

    fn poll_next(mut self: Pin<&mut Self>, _cx: &mut Context<'_>)
        -> Poll<Option<Self::Item>>
    {
        if self.current >= FRAMES.len() {
            self.current = 0;
        }
        let frame = FRAMES[self.current];
        let data = self.format_frame_data(frame);
        self.current += 1;
        Poll::Ready(Some(Ok(web::Bytes::from(data))))
    }
}

But I took a shortcut. Instead of implementing Stream manually, I used stream::unfold, which is like reduce but for streams:

let stream = stream::unfold(
    FrameStream { current: 0, interval, is_curl },
    move |mut state| async move {
        actix_web::rt::time::sleep(state.interval).await;
        // ... produce next item ...
        Some((Ok(web::Bytes::from(data)), state))
    },
);

The state (FrameStream) gets passed into the async block, which produces the next item and returns the updated state. Rinse, repeat, stream forever. It's elegant once you get past the types.

The Rust Tax : You pay upfront in type signatures (Result<web::Bytes, std::convert::Infallible> for an infallible stream?), but you get safety and zero-cost abstractions. No runtime overhead for this streaming abstraction—it compiles down to a state machine.

The Ruby Comparison : In Ruby's Sinatra, I did this:

stream(:keep_open) do |out|
  loop do
    out << render_frame
    sleep 0.1
  end
end

Simple, but you're managing the loop and sleep manually. Rust's stream::unfold encodes that pattern into the type system. More verbose, but impossible to accidentally block the runtime or leak resources.

01 02 2026 🔗

Compile-Time Frame Embedding 🔗

One detail I'm proud of: the frames are embedded at compile time using include_str!:

const FRAMES: [&str; 8] = [
    include_str!("../../assets/frames/frame0.txt"),
    include_str!("../../assets/frames/frame1.txt"),
    include_str!("../../assets/frames/frame2.txt"),
    include_str!("../../assets/frames/frame3.txt"),
    include_str!("../../assets/frames/frame4.txt"),
    include_str!("../../assets/frames/frame5.txt"),
    include_str!("../../assets/frames/frame6.txt"),
    include_str!("../../assets/frames/frame7.txt"),
];

No runtime file I/O. No error handling for missing files in production. The frames are literally part of the compiled binary, stored in the .rodata section. If the files don't exist at compile time, the build fails. Hard fail at compile time beats mysterious runtime errors.

In Ruby, I loaded frames at runtime:

frames = Dir.glob("ascii_frames/*.txt").sort.map { |f| File.read(f) }

This works, but it's runtime overhead, potential I/O errors, and requires the filesystem to be available. For a simple animation, compile-time embedding is perfect.

Trade-off : Binary size increases by ~8 text files. For a banana animation, I'll take it.

01 02 2026 1 🔗

Nix for Rust: Less Painful Than Ruby 🔗

After fighting Nix for the Ruby version's gem dependencies, Rust was refreshing:

outputs = { self, nixpkgs, ... }:
  let
    system = "x86_64-linux";
    pkgs = import nixpkgs { inherit system; };
  in {
    devShells.${system}.default = pkgs.mkShell {
      buildInputs = with pkgs; [
        rustc
        cargo
        rust-analyzer
      ];
    };
  };

That's it. Cargo handles dependencies via Cargo.lock, which Nix respects. No gemset.nix translation layer, no bundlerEnv complexity. Rust's deterministic builds align perfectly with Nix's philosophy.

For production, I'd add pkgs.buildRustPackage, but for local dev? This simple shell is all you need.

The Rust ecosystem's commitment to reproducible builds (via Cargo.lock) makes Nix integration almost trivial. Ruby's dynamic nature fights Nix at every turn. This is one of those moments where Rust's compile-time philosophy pays dividends.

Lets get AI to finish its work

Paul Scarrone — Tue, 27 Jan 2026 00:00:00 +0000

The Nature of Non-Determinism with LLMs 🔗

So you may have heard of weights, biases, and temperature when LLMs are described. For the uninitiated: weights and biases are the core parameters learned during training that encode the model's knowledge, while temperature is an inference-time parameter that controls how much variance appears in the model's outputs. Higher temperature means more randomness in token selection; lower temperature means more deterministic responses. It's exactly this temperature parameter that ensures the model will respond with some variance for the same input. So that's clearly this non-determinism which flies in the face of the normal expectation of computers, but it's this that also provides some of the nuance in token prediction that makes the LLM work so it's easy to identify this as an Architectural Trade-Off and not necessarily a Detractor. So hoping that provides some grounding let's talk about how to make good use of this engine of... making shit up.

Making Shit Up 🔗

Yep, so that's not a tradeoff, it's a flaw, one we haven't solved yet. When the context is ambiguous the model chooses to do one of two things:

Just pretend it didn't hear what it was asked to do
Make shit up, hallucinations

Of course I think the former is not talked about as much as the hallucinations. Not to mention that the hallucinations are harder to detect and protect. Note that hallucinations are actually a separate problem from non-determinism - they're about confidence miscalibration and training data limitations, not temperature variance. Hallucinations can occur even with low temperature settings. But we can take a stab at it with some extra prompting and extra runtimes at the cost of tokens. Don't get too upset this is just the normals of computers, we make a simple thing and it has sharp edges, so we make more things that consume some extra energy to constrain the first.

Usually, these are to solve for the inefficiency of the human communication, but sometimes it's just cause people wanna abuse it. I like to think of Auth as a regular pain point we don't really need but have to have because trust is a hard problem. Most of whats on the web doesn't need centralized authentication but GPG has always been too hard so we made something easier to understand.

What to do? 🔗

Ok, back to the question, well I call it micromanagement but that kind of implies that the model and its agents have some kind of human agency, which they don't. Although some of their processes are directly modeled after humans so we can loosely apply some techniques to rein them in.

First, let's talk about context and ambiguity. If you haven't figured this out yet the longer the context the more the model's attention distributes across tokens, reducing precision on individual details - a "lost-in-the-middle" effect where information gets deprioritized. Most of this is your fault because even with your best effort you introduce inconsistencies and other inaccuracies into the conversation. The lesson, clear your context often and especially between phases of your work, aka, planning, building, and verifying. I like to consider this an analogy to writing and editing. Have someone else edit your work or write it and review it a week later to improve objectivity. Thankfully with LLMs their memory is as ephemeral as you like.

So we need a way to turn a goal into a workstream that allows us to actually look away from the model's stream. Some might call this an agentic orchestration but I feel these often sprint from meaningful to overly complicated in a matter of weeks. Especially if you use something like Claude-Code, Codex, or OpenCode all the building tools are there already. So starting from something like Claude-Code we need to teach our main agent interface to better follow some process when working.

Here is an example:

CLAUDE.MD


## Working Style

When collaborating on this project:
- Check existing files first before suggesting changes
- Ask questions one at a time to refine ideas
- Prefer multiple choice questions when possible
- Focus on understanding: purpose, constraints, success criteria
- Apply YAGNI ruthlessly - remove unnecessary features from all designs
- Present designs in sections and validate each incrementally
- Go back and clarify when something doesn't make sense

## Deliverables

- Break down the decisions from collaboration into tasks
- You must use any defined task tracking tools outlined in the Task Tracking section to create tasks falling back to markdown files if nothing is defined
- Create a report for the executiong plan with dependencies mapped

## Workflow Guidelines

- Create an epic for each high-level objective
- Create subtasks as a todo chain under the epic
- Write titles as the task to be performed (imperative form)
- Add detailed descriptions with examples of work to be done
- Verify each task before closing
- Log details about failures and retries in ticket descriptions for historical tracking
- When an epic is completed, write a report of the task graph and verify all items were performed

Controlling Core Memories 🔗

As I included above Deliverables and Workflow Guidelines we initially want our first pass to be on work breakdown and dependency. This will provide some added benefits the way we will track that work progress though. Often the agent writing code falls victim to the two points above with a couple of variations. Hallucinations in this case are items that just don't work and the remainder is missed features. That's good though because we can track and essentially later interrogate success and failure of the model's execution. Better yet we can finally realize the age old dream that we can repeat a variation of a task in the future more accurately because each replanning is less ambiguous. Good luck doing this with people but with LLMs it's all data.

So memory management moves into tasks, which can be in markdown, Jira via MCP (Model Context Protocol - a standard for connecting AI agents to external tools), or my preference, Beads I don't think there is a lot of big effective differences for me except when we come back to the nature of context size complication introducing confusion.

So beads does for AI what Jira does for humans and yet even as a human I would rather use Beads than Jira. Arguably, the difference is that tools like Beads focus on de-complicating the organization of work, its there for the worker's benefit. Jira on the other hand only benefits the bean counters and the workers just have to suffer so that a very few can complain that the reports it produces are useless.

Sorry, my Jira PTSD is showing... Beads, right Beads lets the coding agent take its task breakdown and put it into a graph with dependencies and epics, these feel meaningless to the agent but it's more about what we get to do with it later. It's easier for me to say to a fresh context, review the epic X and verify its functionality. You'll notice that when it finds something is a failure it usually just tries to fix it but it's also going to record a stream of attempts and what was the final resolution. Resulting in a history of the model's confusion introduced from me or the plan, but when I wanna do something similar I can use the JSONL (JSON Lines format - one JSON object per line) from the beads sync operation to prompt a variation of the task and create a new task breakdown.

Here is a claude partial to explain beads

### Task tracking

Use 'bd' (beads) for task tracking. Run `bd onboard` to get started.

#### bd Quick Reference

bash

Discovery & Navigation

bd ready # Find available work
bd show # View issue details
bd show --children # Show issue with subtasks

Task Management

bd create "

" --type epic # Create an epic
bd create "" --parent # Create subtask under parent
bd update --description "..." # Update description
bd update --status in_progress # Claim work
bd close # Complete work

Sync & Persistence

bd sync # Sync with git (exports to JSONL)


#### Workflow Guidelines

- Create an epic for each high-level objective
- Create subtasks as a todo chain under the epic
- Write titles as the task to be performed (imperative form)
- Add detailed descriptions with examples of work to be done
- Verify each task before closing
- Log details about failures and retries in ticket descriptions for historical tracking
- When an epic is completed, write a report of the task graph and verify all items were performed

#### Displaying Task Graphs

Use `bd show <epic-id> --children` to display the task hierarchy. For visual reports, create ASCII diagrams showing task dependencies and completion status.

Uniqueness vs Repeatability 🔗

This is kind of the funny part of this whole process, the LLM can help with a bespoke task but it doesn't generally improve performance because the context size tends to bias towards failures and you end up having to check its outputs and re-validate anything ambiguous. You may say that you don't need to, but just look at the news, it's the failure mode the AI tools get lambasted on. Of course being an engineer we know that everything is essentially wrong and we are balancing the acceptable amount of wrong we can accept at any given moment.

This of course means that when we can find a process that is refinable to a predictable set of tasks we will end up trying to build some complicated brittle script that can automate the process and here is why building things with computers can be kinda dry. We should let the models handle the fixed set of tasks that need a little flexibility but doesn't offer too much range of opportunities for errors.

Refinement of process from memory is just a strategy but it's one that works quite well since the next agent can read the actions of its predecessor, you can bias it to take the success path and start ignoring it, which is the dream. For years I have been using LLMs and finding myself trapped staring at the console because it prompts me for feedback every couple of seconds building context or I have to endlessly remind it to complete the tasks. Both of these conditions are mostly eliminated.

Deploying to K8s 🔗

A concrete example of this is to deploy an application to kubernetes. This is super well documented and there is a ton of good tooling but it's also a highly configurable system. Each enterprise makes its own rules and policies around how containers are described. It can be very tiresome reading charts and chart documentation while bashing your head against a statement like CrashBackoffLoop. It's not like there isn't a way to learn about what's happening but it's a lot of command orchestration, the LLM can collect context of the failure much better since it can read multiple streams concurrently. So I recently deployed flagd (an open-source feature flag daemon) to our cluster and let Claude handle the building of charts. To keep things simple I let it just write the deployment and service artifacts itself. It failed a bunch but eventually determined from feedback of the k8s MCP that we were missing a config file. It mounted a volume and created a config map, 20 minutes later I have a stable instance running. I will need to do some work to make this fit into organizational norms but I am also ready for my PoC at the same time. I also have a log of what worked and what didn't work so when it comes time to deploy something else to our cluster I have a baseline of tasks which will reduce the planning time and token volume.

An example of this completed task output:

Beads Task Report - January 26, 2026

Epic: Deploy flagd to namespace

ID: feature-flag-prd-q6w | Status: ✅ CLOSED

---
Task Graph

feature-flag-prd-q6w (EPIC) ✅ CLOSED
├── .1 Create flagd Deployment manifest ✅ CLOSED
├── .2 Create flagd Service manifest ✅ CLOSED
├── .3 Deploy flagd Deployment to cluster ✅ CLOSED
├── .4 Deploy flagd Service to cluster ✅ CLOSED
└── .5 Verify flagd deployment is running ✅ CLOSED

---
Task Completion Summary
┌─────┬────────────────────────────┬────────┬───────────────────────────────────────────┐
│ ID │ Task │ Status │ Close Reason │
├─────┼────────────────────────────┼────────┼───────────────────────────────────────────┤
│ .1 │ Create Deployment manifest │ ✅ │ flagd_deploy/k8s/deployment.yaml created │
├─────┼────────────────────────────┼────────┼───────────────────────────────────────────┤
│ .2 │ Create Service manifest │ ✅ │ flagd_deploy/k8s/service.yaml created │
├─────┼────────────────────────────┼────────┼───────────────────────────────────────────┤
│ .3 │ Deploy Deployment │ ✅ │ Deployment exists in cluster │
├─────┼────────────────────────────┼────────┼───────────────────────────────────────────┤
│ .4 │ Deploy Service │ ✅ │ Service exists with ClusterIP 127.0.0.1 │
├─────┼────────────────────────────┼────────┼───────────────────────────────────────────┤
│ .5 │ Verify deployment running │ ✅ │ Pod running after troubleshooting │
└─────┴────────────────────────────┴────────┴───────────────────────────────────────────┘
---
Troubleshooting Log (Task .5)

The verification task encountered issues that required 3 attempts:

Attempt 1 - CrashLoopBackOff
- Container exited with code 0 after printing help
- Fix: Added args: [start] to deployment

Attempt 2 - Fatal: no sync implementation set
- flagd requires a flag source (file, gRPC, or HTTP sync)
- Fix: Created ConfigMap with example flags, added volume mount

Attempt 3 - SUCCESS ✓
- Created ConfigMap: accounts-utility-flagd-service-dev-flags
- Updated deployment with volume mount to /etc/flagd
- Added args: --uri file:/etc/flagd/flags.json

---
Final Deployment State
┌───────────────┬────────────────────────────────────────────────────┐
│ Component │ Value │
├───────────────┼────────────────────────────────────────────────────┤
│ Pod │ flagd-service-dev-87fb5946f-vft4t │
├───────────────┼────────────────────────────────────────────────────┤
│ Status │ 1/1 Running │
├───────────────┼────────────────────────────────────────────────────┤
│ Pod IP │ 10.10.10.102 │
├───────────────┼────────────────────────────────────────────────────┤
│ Service IP │ 172.10.10.10 │
├───────────────┼────────────────────────────────────────────────────┤
│ Node │ node │
├───────────────┼────────────────────────────────────────────────────┤
│ flagd Version │ v0.13.2 │
└───────────────┴────────────────────────────────────────────────────┘
Exposed Ports:
- gRPC: 8013
- HTTP: 8014
- FlagSync: 8015
- OFREP: 8016

---
Result: 5/5 tasks completed - Epic successfully closed

What next 🔗

Here is how I would go about things, start by recording your plans. Maybe take one of my examples and refine it for you and check your experiences. Then install Beads and just manually create tasks and see how the agent interacts. Then go ahead an automate the whole thing but maybe this time we can avoid xkcd:1319 but probably not :)

The Magic of Stubbing sh

Paul Scarrone — Fri, 10 Oct 2025 01:34:25 +0000

I really love sh and bash but I often feel alone and I get some regular negativity when I solve a problem with it. I know why too, shell scripts can have a broad level of complexity that has other languages embedded into it. But its not as esoteric as you might think, more another domain we should be comfortable with. One of the ways I learned to deal with unknown domains was to read the tests. Because tests tend to use some common language they are often more literate. Here's the thing, I keep getting people tell me that shell scripts don't have tests, and they are wrong. See I have this trick, its called BATS and I talked about it over here Test Anything Protocol where I showed an example of stubbing helm but that example was not the whole story. Since the BATS framework is itself bash we have all those nasty tools at our disposal to manipulate our subject under test.

Subject Under Test

Boring as it may be the purpose here is to observe and verify the output and side-effects of commands run by the shell. We need to respect this boundary between our scripts and the tests for those scripts. One of the challenges to this is how commands avoid observation like rm mktemp, if my script creates a tempfile and then removes it it’s hard to verify if that step occurred without modifying the subject. Of course we can write traces to &>2 using echo but that proves nothing more than the presence of the echo statement. I need to verify the validity of these intermediate steps. In traditional programming languages we have mocks and spies which capture the fundamental flow of the code by interfering with the call sites and through reflection. We can do something similar.

Mocking or Stubbing... Whatever

Now there are BATS mocking libraries and they are a wondrous cornucopia of features but in my experience they don't expose much more than a new way of describing, a DSL, how to intercept and modify interactions. So go learn and use those, but for many normal use cases I wanna show you how to do this by hand and use the existing shell language you already know. In the following example we are going to observe tempfiles so we can keep track of an intermediate state, while exposing debugging information when doing TDD, more on that down the line though.

Example

temp.sh Subject Under Test

#!/bin/bash -e

local workspace=$(mktemp -d)

touch "$workspace/not_temp.sh"

local first=$(mktemp)
local second=$(mktemp)

echo "WOW" > $second

rm $first
rm $second

temp.sh.bats

#!/usr/bin/env bats

set +x

bats_require_minimum_version 1.5.0

# Load Bats libraries
load ../../.test/bats/bats-support/load
load ../../.test/bats/bats-assert/load

# Stub rm to capture files deleted
function rm() {
  for arg in "$@"; do
    if [[ "$arg" != -* ]]; then
      cp "$arg" "${TEST_DIRECTORY_RUNNING}/tmp/$(basename "$arg").captured" || return 0
    fi
  done
  command rm "$@"
}

# Stub mktemp to track temp files for cleanup
function mktemp() {
  local tmp
  if [[ "$1" == "-d" ]]; then
    tmp="${TEST_DIRECTORY_RUNNING}"
  else
    read -r counter < $TEMPS_COUNTER
    ((counter++))
    echo $((counter)) > $TEMPS_COUNTER
    tmp="${TEST_DIRECTORY_RUNNING}/tmp/bats.${counter}"
    echo "$tmp" >> $TEMPS
  fi
  echo "$tmp"
}

setup() {
  export TEST_DIRECTORY="./.tests/res"
  export TEST_DIRECTORY_RUNNING="./.tests/res_tmp"
  export TEMPS_COUNTER=${TEST_DIRECTORY_RUNNING}/tmp/.counter
  export TEMPS=${TEST_DIRECTORY_RUNNING}/tmp/.temps
  cp -r "${TEST_DIRECTORY}/." "${TEST_DIRECTORY_RUNNING}/"
  mkdir -p "${TEST_DIRECTORY_RUNNING}/tmp"
  export -f mktemp
  export -f rm

  touch $TEMPS_COUNTER
  touch $TEMPS
  echo 0 > $TEMPS_COUNTER
}

teardown() {
  for tmp in "${temps[@]}"; do
    command rm -f "$tmp"
  done

  unset -f mktemp
  unset -f rm

  command rm -f "$TEMPS_COUNTER"
  command rm -f "$TEMPS"

  unset TEST_DIRECTORY
  unset TEST_DIRECTORY_RUNNING
  unset TEMPS_COUNTER
  unset TEMPS
}

@test 'test intermediate files' {
    local second_tempfile_expected="WOW"
  run bash ./.tests/temp.sh

  # note the captured
  local second_tempfile_actual="$(cat ${TEST_DIRECTORY_RUNNING}/tmp/bats.2.captured)"
  assert_success

  assert_equal $(cat "$TEMPS_COUNTER") 2
  assert_equal "$([ -f $TEST_DIRECTORY_RUNNING/not_temp.sh ] && echo 0 || echo 1)" 0
  assert_equal $second_tempfile_actual #second_tempfile_expected
  assert_output --regexp 'Done'

  # _Note_ The use of `command` which bypasses our function export of `rm` introduced by `export -f rm` this makes sure we use the original command and not our mock.
    command rm -rf "${TEST_DIRECTORY_RUNNING}"
}

Lets explore the mocking... ignoring the directory paths we intercept calls to mktemp and if the commands first argument is -d for directory we inject a static location we control. Otherwise we create a unique file in that directory. When we do this we capture the temp file and the number created so far so we can verify the interfaction later. Both these files can be observed during execution.

# Stub mktemp to track temp files for cleanup
function mktemp() {
  local tmp
  if [[ "$1" == "-d" ]]; then
    tmp="${TEST_DIRECTORY_RUNNING}"
  else
    read -r counter < $TEMPS_COUNTER
    ((counter++))
    echo $((counter)) > $TEMPS_COUNTER
    tmp="${TEST_DIRECTORY_RUNNING}/tmp/bats.${counter}"
    echo "$tmp" >> $TEMPS
  fi
  echo "$tmp"
}

When we write clean scripts we also clean up after ourselves this good behavior provides a challenge to checking the contents of these intermediate files. Because shell scripts are file system based the most common way for data to make its way between processes is to write and read from the filesystem. But if we are tracing a bug in our code we have to regularly interfere with out subject under test to observe its intermediate steps. But if we capture the rm command we can conditionally retain some of the progress. In this example we capture all the args and if one includes a path we extract the filename, append .captured and copy it to our running directory. Ultimately, even if we don't stub mktemp we can still capture deleted tempfiles this way.

Note The use of command which bypasses our function export of rm introduced by export -f rm makes sure we use the original command and not our mock.

# Stub rm to capture files deleted
function rm() {
  for arg in "$@"; do
    if [[ "$arg" != -* ]]; then
      cp "$arg" "${TEST_DIRECTORY_RUNNING}/tmp/$(basename "$arg").captured" || return 0
    fi
  done
  command rm "$@"
}

Now lets review the test, first we can do traditional expectation with the assert module following the standard, Given, When, Then structure we love. Let's look at how the When is structured too, because this is bash whichever assertion fails the program will exit there. So note the last line where we clean up the temp directory for the test. By leaving this as the last statement we keep the test artifacts if the test fails. Which enables better TDD, where we write a test that fails and continue to iterate until that test passes, meanwhile the test is also producing trace and debugging information about our work. We can do this with any command though, say we call git diff and we want to verify what we produced. We can intercept any command and have it write a file to our test workspace. Importantly, while not changing the subject under test.

@test 'test intermediate files' {
    # Given
    local second_tempfile_expected="WOW"

  # When
  run bash ./.tests/temp.sh

  # Then
  local second_tempfile_actual="$(cat ${TEST_DIRECTORY_RUNNING}/tmp/bats.2.captured)"
  assert_success

  assert_equal $(cat "$TEMPS_COUNTER") 2
  assert_equal "$([ -f $TEST_DIRECTORY_RUNNING/not_temp.sh ] && echo 0 || echo 1)" 0
  assert_equal $second_tempfile_actual #second_tempfile_expected
  assert_output --regexp 'Done'

  # _Note_ The use of `command` which bypasses our function export of `rm` introduced by `export -f rm` this makes sure we use the original command and not our mock.
    command rm -rf "${TEST_DIRECTORY_RUNNING}"
}

Just Test Things and Be Happy

This is just one dumb example of how to think about your testing and how to build up useful tooling that caters to your work. Now go write some bash and make sure you test it, trust me orchestrating a call to git is 10 times easier than screwing around with some git integration for your language of choice. These tools were meant to work together in the shell and you will be happier just getting things done. Double happy when you can prove it works with a test.

Errata

sh is not bash and vice versa

While not functionally errors, the title of this work should be focused on bash. Since a lot of the sample code are bash-isms especially exported functions.

the sh alias and CI

run sh ./.tests/temp.sh

sh is often an alias on modern systems and this can have a huge impact when you scripts run in CI or more namely a non-interactive or non-login session. Where you CI might offer an Ubuntu or Alpine Linux image that provides bash as an alias for sh it may use a lighter weight implementation like dash when running your tests. Because we are using features that are explicitly bash we should have our test suite run bash ./.tests/temp.sh as such I have altered the above example accordingly.

Is Pragmatism a Dogma in Software?

Paul Scarrone — Sat, 08 Jan 2022 18:14:46 +0000

Considering the influence of the seminal text, The Pragmatic Programmer, I often consider the intention of the lessons within and the culture, dare I say dogma of pragmatism in the software scene.

If we consider that Pragmatism, loosely defined, is an assertion that objective reality is best viewed in terms of their practical uses and successes(Wikipedia). Specifically, in relation to software development, we as developers are responsible for making features “work” in a pluralistic, problems-oriented environment. Let's explore the complications of this implied pluralism.

It might be easy to only view the natures of pragmatism in software as competitive, but development is a considerably flexible act where conflicts can be moderated by perspective change. Nonetheless, here is a non-exhaustive list for consideration.

Quality, Cost, Time Model
Product Accuracy / Solution Completeness
Risk
Entropy
Communication
Prototyping
Design Orthogonality
You might be catching on to the fundamental complexity that an individual IC(Individual Contributor) might encounter when walking the path of pragmatism can be biased. Regardless if this is due to opinion, wisdom, knowledge or directed by leadership and culture, pragmatism is too broad to observe.

So, here is where we encroach on dogma, and its toxic impact on software development. Pragmatism is a complex topic that is often “simply” explained. Like any sufficiently complex concept, practice and study, is our best defense against erroneous opinions. The risk we run is not having a holistic approach to identifying and repeating pragmatism.

So, I figure this was a good opportunity to share something I found while researching a larger body of work, https://github.com/HugoMatilla/The-Pragmatic-Programmer. Within is a lovely abridgment of the lessons from the book. Now let's compare and contrast the philosophy and the reality of pragmatism.

I think we often approach solving conflicts like the fundamental application of pragmatism by exploring case studies, but since ideologically it's a philosophy, we should try our hands at building a tool set to better understand how we consume it. So instead, we can try to better identify if we are following the practice or the ideal.

Some questions I ask myself:

Am I sufficiently confident that the problem is well-defined?
Am I chasing the problem as opposed to fixing it?
Do I need to take into account violations of the “Law of Demeter”(Prag)?
Have I reviewed the current solution before I started (Maintenance/Enhancement)?
Do I understand the contracts in play(Prag)?
Can I produce assertions for the change to be produced?
Does this make me want to introduce a refactor pass?
Now let's dig in a little further and try to extrapolate the pragmatic outcomes from these questions.

The first two address reactivity and competency to start working. It's a measure twice and write once mentality. Pragmatism as a dogma of small and simple often pushes for action as opposed to development of intention and a plan. Problems, frequently, are not isolated, they are part of a system that takes care to understand. Some context clues here, frequency of code change in the target domain and perceived complexity of the code.

If the code seems to have a lot of collaborators to complete single actions, alarms should be raised that you need to be prepared for more effort. Some context clues are; the combination of delegation, singleton, and inheritance within the same module. Numerous function which have the same or similar name with variable arity.

The rest are about training myself not to provide a commitment to delivery until I have reviewed the existing solution. Preferably I would like to approach problems using TDD (Test Driven Development) but let's be honest we are going to be lucky if we know enough about the solution to write blind tests for a sufficiently complex system. This is approaching a completely different topic but, tests contain assertions and if we use the provided text for guidance here, assertions are the secret sauce. The tricky part is that assertions in practice produce an exceptional state when violated. If we choose to handle these states or use guard patterns, we must assure that they are not introducing or hiding side effects.

Software developers have many roles, but the primary one will always be Change Management, and here is where I come to my last observation / concept, refactoring. Like all things “agile”, refactoring should be progressive changes to reduce risk or omitted entirely. The choice to refactor and when is the most powerful inherent benefit of pragmatic decision-making. All refactors should be considered non-trivial actions unless a suitably concrete case can be produced that would produce a positive constant multiplier of value. It's best to document ugly code, learn from its failures and leave it be, then seek the idealistic justice of a refactor. The simple litmus test is if you don't feel you could go to your manager and propose the refactors benefit to the business, you probably don't need it.

If there is something here you disagree with, then that's probably a good thing. Our source material, first published in 1999, like any influential text, has to be constantly measured for validity against our current lives and careers. If we do not, we risk allowing the ideal of pragmatism evolve into a dogma without the value of its fundamental meaning.