DEV Community: Rocking Eval

I want to build a community and bring developers to my great OSS project. How can I do it? I am able to chat or mentor any individual who wants to contribute. Will it help? How do I even find people interested in doing that?

Rocking Eval — Wed, 24 Jun 2026 05:41:41 +0000

Coding Agents: Moving From "Bash Mimics" to "AST Manipulators"

Rocking Eval — Sun, 21 Jun 2026 11:27:11 +0000

In the last post, we killed the "Tool Abstraction" layer. By replacing 50 brittle JSON-RPC wrappers with a single eeva process (Elixir on the BEAM).

Here is how we moved our agent from text-based hacking to actual AST-aware refactoring.

The Problem: The "Diff" Illusion

Standard agents "edit" files by outputting search blocks:

Plaintext
<<<< SEARCH
def hello_world, do: "hi"
==== REPLACE
def hello_world, do: "hello elixir"
>>>>

This is fundamentally broken. It relies on the model perfectly hallucinating the exact state of the file, including indentation, hidden newlines, and context. If the file has changed by one space since the last read, the entire operation fails. It is brittle, state-blind, and expensive. Coding agents are trying to work this out by introducing not so easy algorithms for multi-edits. Each time model changes a file, for the next edit it needs to read the file again. If this would be standard flow, it would cost you a huge amount of tokens to make it right. Coding agents try to fix it with a few ninja tricks by keeping last edits in memory and shifting next edits. But model doesnt know about this. Eventually, the simple operation of editing files becomes a guessing game both for model and for the harness.

The Solution: Pure Functional Piping

We didn't solve this by introducing custom "edit primitives" or specialized editing tools. Writing unique tool APIs just forces you back into prompt engineering hell.

Instead, we give the model full Elixir execution to edit files directly. Because the code is the tool, the model can execute a complete chain of multi-edits across multiple files in a single, atomic operation using standard language features.

The model doesn't output raw diff blocks; it pipes the files through pure functional transformations.

elixir
config_path = "config/config.exs"

File.read!(config_path)
|> String.replace(":old_port, 4000", ":new_port, 8080")
|> then(&File.write!(config_path, &1))

"lib/workspace/"
|> File.ls!()
|> Enum.filter(&String.ends_with?(&1, "_worker.ex"))
|> Enum.each(fn file ->
  path = "lib/workspace/#{file}"
  File.read!(path)
  |> String.replace("get_port(:old_port)", "get_port(:new_port)")
  |> then(&File.write!(path, &1))
end)

Why This Breaks the Loop

By using the raw programming language as the editing engine, we unlock a few massive architectural advantages:

Zero Context Re-Reads
The model does not need to execute an edit, wait for the harness to update, read the file again, and format a second diff. The state lives inside the Elixir evaluation stream. It can pipe the output of one file read directly into the modification of another, completing complex refactors in a single API call.
Localized Logic, Less Hallucinations
The model doesn't need to guess line numbers or spaces. It writes native Elixir filters, regex matches, or map functions to locate exactly what it wants to change. The search logic is executed live by the BEAM runtime, completely removing the brittle dependency on matching precise whitespace layouts.
Native Multitasking
Because eeva executes inside a supervisor tree, this entire multi-edit code runs inside an isolated, transient BEAM process. If the model makes a logic error mid-stream, the process crashes safely, and the compiler drops the exact structural failure back into the context.
Dynamic Introspection
In a standard text-based setup, the model edits a file, hopes for the best, and then has to call a separate tool to run tests or verify syntax. If the edit broke a dependency three folders over, the agent is blind to it until the entire run crashes.

With pure Elixir execution, the model can introspect its edits inline before finishing the operation. It can pipe its file mutations directly into the live compiler to verify the changes don't break the system:

elixir
# Edit the file, then verify the module still compiles in the same pass
File.read!("lib/math.ex")
|> String.replace("def add(a, b), do: a + b", "def add(a, b), do: b + a")
|> then(&File.write!("lib/math.ex", &1))

# Live validation check before the token stream concludes
case Code.compile_file("lib/math.ex") do
  {:ok, _modules} -> "Success"
  {:error, errors} -> "Failed compilation inline: #{inspect(errors)}"
end

Try it out: https://github.com/beamcore/agent

Coding Agents Suck at Tools

Rocking Eval — Sun, 21 Jun 2026 08:18:29 +0000

Open up the source code of any agent framework, harness, etc. Hermes, Copilot, Pi, Opencode or whatever.
You will find tools, tools everywhere.  
Examples:  
https://github.com/NousResearch/hermes-agent/tree/main/tools https://github.com/anomalyco/opencode/tree/dev/packages/opencode/src/tool

These are bash hooks, file viewers, file editors, greps, questions, tasks, web search.

Your harness is a loop, and on each loop, all these tools and their individual instructions are injected in the context. That is why initial request using opencode is 10k tokens for no reason.

Do we have problems with tools? Yes, and a lot.
Some tools just don’t work properly, there are bugs there in these tools. Some don't count lines properly, some truncate files in order to save some tokens. Some invalidate cache, making your token/$ ratio worse.

Tool signatures differ from one harness to the other. And models absolutely suck at this. The longer session goes, the more model “forgets” and shifts its attention from this noisy tool descriptions. Harness starts failing on basic operations like finding files. Poor models drop into endless loops, eating your budget with no output at all.

When you force a model to context-switch between writing clean JavaScript and formatting a deviant, rigid JSON payload just to view a file, it all breaks down, and if not, its just not efficient.

Models are not trained on every harness available, they are trained on bash and coding. That’s what they need to do, and that’s why Pi is so good. But it could be better! 

The failure peak is always file editing. Writing a file from scratch is a forward-flowing token stream, quite easy we guess. Editing an existing file requires the model to hold an exact mental map of the code's Abstract Syntax Tree (AST), match white-space indentation perfectly (tabs vs. spaces), and calculate precise line diffs.
When poor models attempt a search-and-replace edit, it almost always misses a newline or a trailing brace. The harness rejects the edit. The model gets confused by the raw bash or parser error, loses its place in the file, and begins modifying the wrong lines entirely - corrupting the codebase until the context window is nothing but garbage. Tools produce garbage and pollute your context. The more tools harness has, the more unrelated garbage is in your context.

We think we solved it. What if harness will just build code to edit other code? It’s already trained on doing that, kind of. So we decided to build a harness with only 1 “tool”, which is Elixir Eval. We call it eeva.

Elixir is a perfect language for models. It both looks similar to bash, has the same “piping” behavior like bash, has very similar out of the box functions like File.ls or File.read. And it’s a clean functional language, models are very good at this. They are good both with bash, and with Elixir. They combine the knowledge and attention to solve tasks. Every time model fails to make an operation, harness feeds the model with always similar Elixir error traces.

This seems like a small change, but it really flips the game a bit. The longer your agent is working, the longer the context, the more precise are the edits and operations. Instead of feeding the context with junk errors of random tools, we feed it with elixir compile errors, forcing model into elixir, basically “fine-tuning” it on the fly with high quality outputs and results.

With bigger context all coding agents are failing eventually, even ours. But the ceiling is much higher this time.

So if you wanna try this approach, take a shot: https://github.com/beamcore/agent