yossi

Posted on Mar 5 • Edited on Mar 31

What if we extracted literally EVERY behavior of dev.to into Markdown? (An AI Agent Experiment)

#opensource #devto #forem #vibecoding

TL;DR

I am running an experiment where I forked Forem, the OSS behind dev.to, and used AI agents to extract "every single behavior" as Markdown specification documents
Agents autonomously analyzed file dependencies spanning Rails Controllers, React Components, and Workers, and successfully constructed a network of specifications with bidirectional links (ULIDs)
I am building a new SDD (Spec-Driven Development) toolkit called specre to make this approach possible

A question for you, the reader

In a large codebase like Forem, can you immediately grasp where a specific feature (e.g., inviting a member to an Organization) has an impact?
Can an AI understand that efficiently and accurately?
Have you ever been frustrated when an AI failed to grasp context that you thought was perfectly clear?

yoshiakist / specre

Atomic, living specification cards for AI-agent-friendly development. Minimal, agnostic, and traceable.

English | 日本語

specre

Atomic, living specification cards for AI-agent-friendly development.

specre ( /spékré/ ) is a minimal specification format and toolkit for Spec-Driven Development (SDD). Each specre is a single Markdown file describing exactly one behavior, with machine-readable front-matter for lifecycle tracking and agent navigation.

The Problem

Specifications are essential for keeping development intent visible and traceable. But in practice, they rot:

Specs drift from code in silence. No one notices when an implementation diverges from its specification — until the next developer (or AI agent) builds on stale assumptions.
Monolithic specs waste AI context. Large specification documents force agents to parse entire features just to understand a single behavior, consuming the finite context window that should be reserved for code and tests.
Small changes never get specced. When the cost of writing a specification is high, only greenfield features get documented. Bug fixes, refactors, and incremental changes slip…

View on GitHub

What was actually built

I'll go into detail later in this article, but first — just take a look at these images!

All these behavior specifications are generated by finely-tuned workflow and scripts.

Related files, overview, intent, key members, scenario and failure patterns can be suddenly provided for coding agent's context.

Each file has a "specre tag" formed with ULID. It can make bidirectional link between specification(s) and construct network.

1. Why I built `specre`

The limits of existing specifications and TDD

Specifications always rot. SDD toolkits have become more widely used, and even when you use tools that keep spec Markdown files in mind as part of the pipeline, those specs begin to drift from reality the moment they are created.
It's a natural thought that if we reuse code, we should reuse specifications too. But in practice, the harsh truth is that solving this problem with natural language alone is extremely difficult.

I always write code test-first (and I imagine most of you do the same at work). I do the same when having AI write code, and I enjoy designing workflows that way. At the same time, however, I'm starting to feel the limits of TDD in this era of Vibe Coding.
You can prove the correctness of behavior to a machine, but for humans — or for PMs and QA — it's not intuitively clear from the current codebase "what value is being delivered to users." Not without clicking through screen after screen, again and again.
Why is it important? Why is it necessary? Why won't a different approach work? These fragments of an engineer's thinking sometimes survive as code comments, but there is no way for an AI to grasp a coherent "intent" that spans multiple files.

The context window limits of AI agents

Providing value through a set of features requires writing massive specification documents, but that wastes the LLM's context window unnecessarily. When an agent (or a subagent receiving instructions) only cares about "one specific behavior," a workflow that repeatedly runs grep across the entire codebase and reads entire files that are likely irrelevant is extremely inefficient.

Furthermore, the more context you feed an AI agent, the more diluted its meaning becomes. If important information appears in the middle, it becomes easier to forget, and the likelihood that it will be taken into account in the final output keeps dropping.

The specre philosophy: "One Markdown, one behavior"

Here I'll briefly describe what makes specre distinctive. For full details, please check out the README.

Bidirectional traceability between source code (@specre comments) and Markdown specifications using ULIDs reduces the reasoning cost for an agent that wants to know "what is the intent of this code" or "what code implements this intent" to zero
Lifecycle management via status and last_verified fields in front-matter, enabling detection of stale specs and drift between spec and reality
A fast CLI written in Rust. In particular, search combined with a project-specific vocabulary glossary efficiently delivers a small amount of targeted information plus hints for what to search next, rather than an overwhelming flood of results
An MCP server makes specre commands the first-choice tool for coding agents during planning and initial exploration

2. Applying specre to Forem

Why Forem?

Forem — the codebase behind dev.to — is a practical, large-scale codebase that developers all over the world know. With over 3,000 Ruby files and 700+ JS/JSX files, it felt like exactly the right size to explore specre's practicality and applicability.
And honestly, I've always had a small dream of posting something on dev.to as a developer. (I'm genuinely excited right now!)

How the extraction works

Using simple RAG or existing AST tools, there is simply no way to describe "intent." To generate behavior specifications all at once, I built a four-phase multi-agent workflow.

Phase 1–2 (The Brain — Claude 4.6 Opus):

To save tokens, the agent reads only the AST (Abstract Syntax Tree) structural map, not the raw source code.
A custom script is used to extract structure only. If a domain is too large, this AST extraction script autonomously proposes splitting it into sub-domains to prevent token explosion.

In Phase 1, related files and dependencies in the codebase are identified by domain name, and files deemed especially strongly related are organized.

Prompt to the agent:

/specre-generate feed domain, pls!

Agent begins discovery:

python3 .claude/commands/scripts/domain-discovery.py feeds --json --root . 2>&1

The output of script:

Domain discovery: feeds
Seed class names: FeedConfig, FeedEvent, FeedbackMessage, Feeds, ...
Total: 290 files (258 untagged, 32 tagged)

Output split into 3 parts (>100 files):
  /tmp/specre-discovery-feeds-part1.json
  /tmp/specre-discovery-feeds-part2.json
  /tmp/specre-discovery-feeds-part3.json

One of the ouput JSON files:

{
  "domain": "feeds",
  "seed_class_names": ["FeedConfig",  "FeedEvent", "FeedbackMessage", "Feeds", ...],
  "stats": {
    "stage1": 108, "stage2": 140, "stage3": 42, "total": 290,
    "untagged": 258, "tagged": 32
  },
  "part": 1,
  "total_parts": 3,
  "files": {
    "app/assets/javascripts/lib/xss.js": {
      "stage": 2,
      "reason": "import: from app/javascript/articles/__tests__/Feed.test.jsx",
      "specre_tags": []
    },
    "app/assets/javascripts/utilities/timeAgo.js": {
      "stage": 2,
      "reason": "import: from app/javascript/articles/__tests__/Feed.test.jsx",
      "specre_tags": []
    },
    ...

In Phase 2, another script optimized for Ruby and JS is run against specific files identified in Phase 1, extracting structure that includes method names and what each method returns.

From this reasonably reliable structural information, the agent infers "cross-layer behaviors" spanning Rails Controllers through React components and Workers, and designs a behavior catalog.
This catalog contains only the specification names (e.g., user_can_signup_with_email.md) and the files related to each one.

At this stage, naming, classification, and granularity are critically important. The workflow includes a self-review step where the agent checks whether the specification naming accurately captures the behavior and value, and whether anything has been overlooked.

The single most important rule throughout Phases 1 and 2 is that reading the actual files is strictly forbidden. Let Opus — smart as it is — focus exclusively on file paths, dependencies, method names, and return values.

Phase 3 (The Workers — Claude 4.6 Sonnet):

Based on the catalog Opus produced, multiple Sonnet subagents are launched in parallel.
This is the first point at which actual code is read and natural-language scenarios are written.
At this stage, the status in front-matter is kept as draft.
Only if a test is judged to sufficiently cover the behavior is the status upgraded to stable.

Additionally, @specre <ULID> tags are automatically and rapidly embedded into the source code via MCP server calls to the specre command. With these tags in place, when implementing or fixing a feature, an agent can run specre trace via MCP to instantly cross-reference spec and source in both directions.

Examples of actual output

The output is basically what you see in the images above, but for more detail please refer to the links below. These point to my fork of Forem.

Note 1: Even if you are using an environment that does not support subagents outside of Claude Code, the workflow should be usable with minor adjustments to the prompt wording.
Note 2: Parser scripts for languages other than Ruby/JS/JSX have not yet been created. See "Challenges" below for details.

3. The power of a network woven from code and specifications

What has actually happened now that specre has been applied to Forem?
Please watch the following videos.

Demo of searching specification card by natural language:

Demo of tracing code intent by natural language:

A revolution for coding agents

The reasoning cost of exploration drops to nearly zero. When asking an agent to fix a bug, the agent follows a path like this:

Human: "Change the spec where XXX does AAA so that it does BBB instead."
Agent: specre search "xxx aaa"
Agent: "Got 8 specre cards. One likely describes this specification. Let me look at it."
Agent: "I fully understand the specification and intent. Now let me investigate the related files..."

From there, 5–6 reads are enough to form a rough plan for fixing the feature.
In my experience, when an existing spec is available, the total number of commands the agent issues before forming a fix plan is around 10–15.

What makes this groundbreaking? A coding agent normally moves in the order: broad exploration → grasp behavior and spec → grasp intent. With specre, the order becomes: grasp intent and behavior simultaneously → read related files to "confirm." That reversal dramatically reduces the agent's reasoning cost.

The importance of a deterministic approach

Another important point is that these processes are deterministic rather than probabilistic. If the related files for an atomic specification have been verified, then exploration to second- and third-order nodes is also deterministic. It's on the specre roadmap, but the Rust CLI will be able to instantly answer questions like "what is the potential blast radius of this change?"

Existing AI agents, when fixing bugs, gather likely-related files through grep or vector search — probabilistically, by guess — and as a result they sometimes break unrelated files or miss critical dependencies. With specre's ULID tags, however, an agent can deterministically identify the scope of impact of any change.

Value for humans (developers, PMs, QA)

Reducing onboarding cost

With specre, even someone completely new to using or contributing to Forem can immediately understand what features the system provides and which files are involved in each behavior. For example, if you wonder "what is a broadcast, and how is it created?", running specre search "create broadcast" will surface the relevant spec Markdown instantly. If you want to know "what features involve email?", looking at the email directory under specres makes it immediately clear from the listing.
specre minimizes the onboarding cost for new contributors.

Reclaiming control over complex systems

When modifying a feature, you can edit the existing specre Markdown as a requirements definition and hand it to the agent — a consistent implementation patch comes out. The same applies when adding variations to existing behavior.
Of course, adding an entirely new behavior (i.e., a new feature) is straightforward too. Start by writing the intent and a rough sketch of the scenarios. The agent will then refine it into a polished spec following the specre format, complete with feature overview, intent, scenarios, and failure/edge cases. All you need to do is review and approve.
Once you try it, you'll notice that even standard Vibe Coding steps feel more structured and grounded. This gives you a stronger sense of control, even when implementing against a complex domain.

Lately, there's a narrative around Vibe Coding suggesting that the human's role is only to inject "intent" — that humans should let go of micro-level control. For software at a certain scale, that's probably true.
But what about complex software where the behavior of one domain affects multiple other domains or multiple microservices? How confident are you that a coding agent won't introduce a single bug just because you told it the intent? For example, if you ask an AI to "fix the feature that changes an article's cover image" in Forem, are you comfortable fully delegating to the AI whether that change safely propagates to the mobile app API (ForemMobile) and CDN cache invalidation?

With specre, humans only need to review the "scenarios" and "failure pattern definitions" in the Markdown to reclaim control over whether the AI has missed any edge cases.
Personally, I don't think we should yet abandon the human intuition about cross-domain side effects.

As a foundation for cross-functional collaboration

For non-engineers such as PMs and QA, these natural-language scenarios also become a hub for accurately understanding system behavior. By understanding behavior in more detail than the vague layer of "intent," the gap between expectations and actual deliverables can be minimized.

This also delivers a particularly strong ROI for QA. For many QA engineers, analyzing unit tests offers limited value. Even if you write a script to effectively extract and read it and describe blocks, test cases are often too simple, or the descriptions in it/describe are omitted when the intent is obvious from the test code itself.
If you want to use them to explore edge cases, what you need is not unit tests as a starting point, but documentation that describes the expected behavior of the system in scenario-based, natural language. With specre, the current state of the code should be something the QA team can actually reference.

Drift detection

I'll keep this brief since it's something I'm actively working on in the current roadmap: specre enables detection of stale specifications.
Code is a living thing, and when developing in a team, it's inevitable that spec updates get forgotten. By having an agent patrol on a CI or heartbeat schedule, we can detect and report discrepancies between specs and actual code, ordered by oldest last_verified date.
When human users are modifying a feature or editing related files, they'll also naturally notice that the spec itself is outdated — or it will serve as a clue when trying to figure out whether the spec or the code is wrong.
The specre philosophy that specifications should have a lifecycle largely originates from this perspective.

4. Challenges and dilemmas

I've been painting quite a rosy picture so far, but of course this is not magic. Let me share the real-world hurdles too.

The adoption cost paradox

Token cost

One of specre's goals is eco-friendly AI development — being kind to both your wallet and the planet by saving tokens. However, I also discovered that the initial bootstrapping requires enormous computational resources. Even with Claude Max ($100), it took 5 full days of manual orchestration to process all over 40 domains.

My hypothesis for a solution: if you have access to a higher API plan (such as Claude Max $200), processing all the project's domains in one batch might yield dramatically better cost efficiency in the end.
With Opus 4.6 + 3-parallel Sonnet 4.6 subagents, roughly 20 minutes of inference per domain was needed, depending on domain size.
Whether that feels like a reasonable cost or an enormous time investment — what do you think?

When adopting specre into an existing project, what specre is fundamentally doing is "pre-paying the search cost."
But the larger the team and the longer the project, the more powerfully this pays off. After all, the network of code and specifications lives in the repository itself — one person pays the cost, and everyone else receives the benefit.

Human effort cost too

I also had to wrestle quite a bit with decomposing Forem's internal structure and designing domain granularity, since I was seeing it for the first time. Ideally, the person who knows the codebase best (who is usually also the busiest person on the team) would need to oversee this entire workflow.

For example, in Forem, tags and liquid_tags are qualitatively entirely different domains with completely different concerns. But I didn't notice that at first, ran the specre generation workflow for tags as a whole, realized something was off, and eventually had to start over. You should start from the smallest, most specific domains (in this case liquid_tags) and work outward toward larger, more comprehensive ones — but this depends entirely on human knowledge of the system's characteristics.
In my experience, I don't currently believe that an AI can autonomously reason about and correctly design this kind of effective execution order.

Note: Partway through, I also experimented with an exploration approach using AST and vector indexes built with cocoindex, instead of my custom scripts. If you want to support all programming languages and frameworks, that approach may have better fundamentals.
Both workflows are documented in the forked repository.

Human review is still essential

Opus's reasoning ability is extraordinary, but the step where a maintainer (a human) with domain knowledge verifies "is this specification actually correct?" cannot be skipped. The specre adoption guide also strongly recommends that humans review this step.
That said, reading through every single specification across all domains from scratch during bootstrapping is genuinely hard work. Most people's gut reaction would be to refuse outright.

Hey Forem maintainers (if you are reading this!): After reading a few of the specifications I generated in my environment, do they seem reasonable to you? Since I can't review them with any real authority, there's a very real possibility that these specs — wrung out of Claude Code through sheer persistence — are nothing but a pile of garbage.
(And if that's the case, please don't hesitate to say so. If I need to fundamentally rethink some aspect of the approach or methodology, I want to know.)

AI-generated specifications are inherently probabilistic

In generating specs for every behavior in Forem, I relied heavily on AI. This means that even though I claim "verified specres enable deterministic traversal of the network," the network itself was generated probabilistically.
Did the AI truly describe every behavior? Did it list all related files — comprehensively and without redundancy? The truth is unknowable without verification by a core maintainer.

For example, Forem has a broadcast feature (a user-facing announcement displayed near the nav bar), so I instructed Claude to cover that domain with specre. The output spec titles were:

admin_can_create_broadcast
admin_can_view_broadcast
admin_can_list_broadcast
admin_can_edit_broadcast
admin_can_delete_broadcast

So I asked: "Okay. But who actually sees a broadcast, and where? Who benefits from a broadcast existing?" Claude replied: "I forgot to include the behavior where a user views it — shall I add that?"

...And yet, there's still some reason for hope.
Even I — having only just signed up on dev.to two weeks ago and tentatively poking around Forem for the first time — could catch that. There's no way a maintainer wouldn't.
And the specre generation workflow is still evolving. There's still plenty to be done: building verification chains, adding constraint-check flows delegated to separate subagents, designing dedicated commands for validation, and more.

5. Closing thoughts and a question for the community

We're still very much in the middle of this journey, but I genuinely feel that this approach has the potential to fundamentally change the future of SDD (Spec-Driven Development).
I myself develop indie games using Godot Engine and use specre for that work. So even if nobody else in the world uses it, I'll probably keep developing specre for my own benefit for a while.
That said, what might this look like in five years? A model that can internalize a multi-gigabyte project in an instant and fully grasp its behavior as though it were one giant function — something like that wouldn't surprise me too much if it emerged. But until that day comes, I believe that atomic specification documents can serve as a reliable guide for engineers and product teams navigating complexity.

Discussion

If your team's project had a "network of specifications linked to code" like this, would that be valuable? Would it make collaboration with AI easier?
Could it work as a communication format for specifications with non-engineers such as PMs or QA?
specre is a brand-new project. If this approach resonates with you, please check out the GitHub repo and leave a star!

Star specre on GitHub ⭐️

A note on AI use in writing this article

This article was written by me in Japanese, then translated into English with the assistance of AI.

DEV Community

What if we extracted literally EVERY behavior of dev.to into Markdown? (An AI Agent Experiment)

TL;DR

A question for you, the reader

yoshiakist / specre

Atomic, living specification cards for AI-agent-friendly development. Minimal, agnostic, and traceable.

specre

The Problem

What was actually built

1. Why I built `specre`

The limits of existing specifications and TDD

The context window limits of AI agents

The specre philosophy: "One Markdown, one behavior"

2. Applying specre to Forem

Why Forem?

How the extraction works

Phase 1–2 (The Brain — Claude 4.6 Opus):

Phase 3 (The Workers — Claude 4.6 Sonnet):

Examples of actual output

3. The power of a network woven from code and specifications

A revolution for coding agents

The importance of a deterministic approach

Value for humans (developers, PMs, QA)

Reducing onboarding cost

Reclaiming control over complex systems

As a foundation for cross-functional collaboration

Drift detection

4. Challenges and dilemmas

The adoption cost paradox

Token cost

Human effort cost too

Human review is still essential

AI-generated specifications are inherently probabilistic

5. Closing thoughts and a question for the community

Discussion

A note on AI use in writing this article

Top comments (0)

TL;DR

A question for you, the reader

yoshiakist / specre

Atomic, living specification cards for AI-agent-friendly development. Minimal, agnostic, and traceable.

specre

The Problem

What was actually built

1. Why I built specre

The limits of existing specifications and TDD

The context window limits of AI agents

The specre philosophy: "One Markdown, one behavior"

2. Applying specre to Forem

Why Forem?

How the extraction works

Phase 1–2 (The Brain — Claude 4.6 Opus):

Phase 3 (The Workers — Claude 4.6 Sonnet):

Examples of actual output

3. The power of a network woven from code and specifications

A revolution for coding agents

The importance of a deterministic approach

Value for humans (developers, PMs, QA)

Reducing onboarding cost

Reclaiming control over complex systems

As a foundation for cross-functional collaboration

Drift detection

4. Challenges and dilemmas

The adoption cost paradox

Token cost

Human effort cost too

Human review is still essential

AI-generated specifications are inherently probabilistic

5. Closing thoughts and a question for the community

Discussion

A note on AI use in writing this article

1. Why I built `specre`