It is time to put my proposition made in my previous blog post to the test. Is it possible to spec an application for execution by an agent without encoding it in source? Let's find out.
One type of application every knowledge worker is familiar with is task management. Every task has a lifecycle status, dependencies on other tasks, and a history of progress.
Let's give agents their own.
TaskTrack is a simple but non-trivial task management system variant implemented as a Specify spec. It goes beyond checkbox-based to-do lists that agents sometimes use internally and mimics the key system features listed above.
TaskTrack defines two procedures: a "Plan Authoring Run" to create an interconnected set of tasks from requirements and a "Plan Execution Run" to advance a previously authored plan toward completion. One execution run might not always be enough to achieve completion, because TaskTrack allows requesting human feedback and incorporating it during the next execution run. Furthermore, every execution run is divided into "Task Processing Run" sub-procedures to allow for advanced agent context management.
TaskTrack implements all of this in less than 300 lines of text. If the implementation used source code, then, depending on the programming language, this would be enough space to implement only the required file I/O operations (TaskTrack uses files for simplicity, not a database). Natural language can easily become quite bloated, but a stringent, scientific writing style and extensive use of what the Specify standard offers can effectively counter that.
The official test is, how could it be any other way, the implementation of yet another uninspired Breakout clone. The requirements, the completed TaskTrack plan, and the deliverable are contained in the repository.
If you want to run the test yourself, the included README file contains the necessary information, including the launch prompts for both the authoring agent and the execution agent. Please note how both launch prompts are structured. They use TaskTrack terminology and point to the relevant files. They do not contain task-related behavioral instructions. The execution agent launch prompt contains agent-specific instructions for mapping agent features to the generic TaskTrack specification. The principles behind the good old manual coding design patterns remain valid even in the agentic era!
And now, finally, for the test result. In a nutshell: It works!
The authoring agent created all TaskTrack files as indicated, which is, maybe, less surprising or impressive. More importantly, the execution agent showed deterministic behavior over all 16 tasks and two execution runs. I often hear that deterministic behavior must remain encoded in source due to the inherently random, and hence non-deterministic, nature of LLMs. I cannot confirm this based on the test result. The execution agent followed the step-by-step procedure definition by the book each and every time. Even the defined textual output was created as reliably and repeatably as if it were produced by a print statement.
It goes without saying that this single test result does not deliver a general proof of the viability of speccing. It shows it can work; it is possible. Maybe non-deterministic agent behavior is more often than not the result of unspecific instructions rather than randomness in the underlying LLM.
Having said all this, the test run was far from being perfect. It produced several so-called valuable learning experiences.
The first and most obvious finding is that all but one of the timestamps are incorrect. The authoring agent wrote and executed a Python script to retrieve the current UTC time. All task processing subagents simply invented timestamps. When I later asked the system about this difference in behavior, it gave an interesting answer: Creating a new timestamp is a "single, salient, one-off step... worth a real python/date call." Updating timestamp fields is "a repeated, mechanical step... every task, every run, in fresh subagent contexts," and that "models systematically deprioritize repeated boilerplate."
This is not a TaskTrack issue but rather the result of an ill-equipped agent. And it is at this point, where, no matter how hard I try, I cannot stop myself from making the tongue-in-cheek remark that the machines that are feared to first fire and then nuke us apparently have no built-in access to the current time... I will keep this in mind, just in case.
The second finding is that, as the agent itself remarked when reviewing the test results, task resolutions are not necessarily as brief as mandated by the TaskTrack specification. But then, what is brief? Precisely. This is the kind of hastily written, hand-wavy instruction that is open to interpretation and leads to varying results. Just because we are using natural language now does not mean we are allowed to let our rigor slip.
Luckily, it is not a major pain point, since it only affects the resolution, not the core processing logic. Still, it is worth fixing in a future publication.
The third finding is that, strictly speaking, the test run was flawed because these wonderful machines now have memory. Both the authoring and execution agents revealed in their thinking output that they were aware that this was a test. I do not think this flaw invalidates the qualitative test result as such. Still, future test setups will require more care and consideration.
In the meantime, the TaskTrack specification is live, the license is permissive, and the floor is open. Have a look around, and let me know what you think.
Top comments (0)