DEV Community

Cover image for Building an AI-Powered Code Editor: (part 2) LLM like interpreter
Francesco Marconi
Francesco Marconi

Posted on

Building an AI-Powered Code Editor: (part 2) LLM like interpreter

It’s Not a Prompt. It’s a Procedural DSL in Natural Language

While building LLM CodeForge, an agentic editor that allows LLMs to read, modify, and test code autonomously, after 5000 tokens of instructions, I realized something:

I wasn't writing a prompt. I was building a Domain-Specific Language embedded in natural language.

This article analyzes how and why this distinction is fundamental—and what you can learn for your own agentic systems.

The LLM Doesn't Decide, It Executes

Initially, I thought I was "instructing" an LLM on how to behave. But watching the system work, I realized I was doing something different:

I was forcing the model to impersonate an interpreter.

The model in CodeForge:

  • Does not decide what to do
  • Decides only which branch of the protocol to follow
  • Does not solve problems creatively
  • Executes a procedure described in natural language

This is very close to:

  • A bytecode interpreter
  • A text-driven Finite State Machine
  • A planner with closed actions

And it works because I accepted that the LLM is fundamentally unreliable.

Anatomy of an Embedded DSL

1. Control Flow: Decision Protocol

Here is what the DSL's "control flow" looks like:

Every request follows 4 steps: 
[UNDERSTAND] → [GATHER] → [EXECUTE] → [RESPOND]

### Step 1: UNDERSTAND
Classify request type:

| Type         | Keywords              | Next Step           |
|--------------|-----------------------|---------------------|
| Explanation  | "what is", "explain"  | [RESPOND] text      |
| Modification | "add", "change"       | [GATHER] → [EXECUTE]|
| Analysis     | "analyze", "show"     | [GATHER] → [RESPOND]|
Enter fullscreen mode Exit fullscreen mode

This is not Chain-of-Thought in the classic sense. It is deterministic task routing—a decision table mapping input → workflow.

The model doesn't "think", it executes a conditional jump.

2. Invariants: Read-Before-Write Policy

The DSL defines invariants that must be maintained:

🚨 CRITICAL RULE: You CANNOT use `update_file` 
on a file you haven't read in this conversation.

Self-check before ANY update_file:
- [ ] Did I receive the content from system?
- [ ] Do I know exact current state?
- [ ] Am I modifying based on actual code?

If ANY answer is NO → OUTPUT read_file ACTION, STOP.
Enter fullscreen mode Exit fullscreen mode

This is an attempt to define pre-conditions in natural language. It’s like writing:

def update_file(path, content):
    assert path in conversation_state.read_files
    # ... actual update
Enter fullscreen mode Exit fullscreen mode

But without a type system or automatic runtime enforcement.

Important note: This rule significantly reduces the probability of the LLM modifying a file without having read its content first, but it does not guarantee 100% enforcement. It is a constraint expressed in natural language, therefore subject to the probabilistic interpretation of the model. In tests, I observed high stability (~85-90%), but server-side validation remains fundamental for critical cases.

3. State Management: Injection + Hard Next Action

The most effective technique I implemented is dynamically regenerating the prompt to force the LLM to follow a multi-step plan.

Concrete scenario: The user asks "Add authentication to the project".

Step 1: The LLM analyzes and generates a plan:

{
  "plan": "I will modify these files in order:",
  "files_to_modify": ["Auth.js", "Login.jsx", "App.jsx"]
}
Enter fullscreen mode Exit fullscreen mode

Step 2: The LLM starts with the first file (Auth.js) and completes it.

Step 3 - HERE IS THE TRICK: Instead of asking the LLM "remember the plan, do the next file", I completely regenerate the prompt adding this section:

### ⚠️ MULTI-FILE TASK IN PROGRESS

You completed: Auth.js
Remaining files: Login.jsx, App.jsx

### 🚨 REQUIRED ACTION
Your next output MUST be:
{"action":"continue_multi_file","next_file":{"path":"Login.jsx"}}

Do NOT do anything else. Do NOT deviate from the plan.
Enter fullscreen mode Exit fullscreen mode

Result: The LLM doesn't have to "remember" anything. It cannot "forget" the plan. It cannot decide to do something else. The prompt itself contains the only possible action.

Why it is powerful:

  • The state (which files I've done, which are missing) lives in the JavaScript code, not in the LLM's "memory"
  • At every step, I regenerate the prompt with the updated state
  • The LLM always sees "you are here, do this" — zero ambiguity

In practice:

// In the code
function buildPrompt(multiFileState) {
  let prompt = BASE_PROMPT;

  if (multiFileState) {
    prompt += `
    ### TASK IN PROGRESS
    Completed: ${multiFileState.completed.join(', ')}
    Next: ${multiFileState.next}
    Your ONLY valid action: continue_multi_file with ${multiFileState.next}
    `;
  }

  return prompt;
}
Enter fullscreen mode Exit fullscreen mode

This is state injection: external state (JavaScript) completely controls what the LLM can do at the next step.

4. Simulated Type System: Structured Output

To handle different "types" (JSON + code + metadata), the DSL uses custom delimiters:

#[json-data]
{"action":"create_file","file":{"path":"App.jsx"}}
#[end-json-data]
#[file-message]
This file implements the main app component.
#[end-file-message]
#[content-file]
export default function App() {
  return <div>Hello World</div>;
}
#[end-content-file]
Enter fullscreen mode Exit fullscreen mode

Why not standard JSON or XML? Because the content might contain {} or <> — it would require complex escaping.

The delimiters #[tag]...#[end-tag] are:

  • Syntactically unique (no conflicts with internal code)
  • Easy to parse (regex + split)
  • Independent of the embedded language

It is like defining a context-free grammar to separate semantic levels.

5. Error Handling: Anti-Pattern Documentation

The DSL includes "error examples" to guide the parser (the model):

**Common Errors:**

❌ {"action":"start_multi_file","plan":{},"first_file":{...}}
✅ {"action":"start_multi_file","plan":{},"first_file":{...}}}

❌ #[json-data]{...}#[file-message]...
✅ #[json-data]{...}#[end-json-data]#[file-message]...
Enter fullscreen mode Exit fullscreen mode

This is inline error-correction training — I am teaching the model the common failure modes. It’s like unit tests embedded in the language documentation.

The Price: Structural Tensions

Building a DSL in natural language isn't free. Here are the trade-offs I accepted:

Tension #1: No Real Type System

I am creating a procedural DSL, but:

  • ❌ It has no verifiable types
  • ❌ It has no automatic syntactic validation
  • ❌ It has no AST for transformations

Result:

  • ✅ Huge validation checklist (8+ points)
  • ✅ Semantic redundancy (same rules, 3+ formulations)
  • ✅ Extensive anti-pattern documentation

This is inevitable when the parser is a probabilistic LLM instead of a deterministic compiler.

If I were to evolve CodeForge in the future, a true mini-DSL (JSON Schema + codegen) would reduce the prompt by 30-40%. But in the browser sandbox, this choice is justified.

Tension #2: Meta-Validation Only Works as a Multiplier

The pre-send checklist I implemented:

Before EVERY response, verify:
| # | Check | Fix If Failed |
|---|-------|---------------|
| 1 | JSON valid | Correct structure |
| 2 | Tags complete | Add missing #[end-*] |
Enter fullscreen mode Exit fullscreen mode

Alone, it would have 40-60% reliability. In my system, it probably has 80-90%.

Why? Because it works as a stability multiplier when:

  • The model is already channeled (decision protocol)
  • The format is rigid (custom delimiters)
  • The next action is deterministic (state injection)

Meta-validation is not the main feature — it is the final safety net in an already constrained system.

Tension #3: High Cognitive Cost

5000 dense tokens with intersecting rules means:

  • ✅ Works well with Claude 3.5, GPT-4
  • ❌ Smaller models will fail
  • ❌ Less aligned models will ignore sections

I am implicitly saying: this system requires "serious" models.

It is an architectural constraint I accepted — like saying "this library requires Python 3.10+".

Techniques That Work: Contextual Re-Anchoring

One thing I did (almost by instinct) is contextual re-anchoring.

Take the "read-before-write" rule:

  • It appears in the Decision Protocol (when planning)
  • It appears in Available Actions (when executing)
  • It appears in Pre-Send Validation (when verifying)
  • It appears in Golden Rules (as a general principle)

This is not random redundancy. It is strategic repetition in different contexts.

It is exactly how it is done in safety-critical systems:

  • Same invariant
  • Verified at multiple levels
  • With specific phrasing for the context

Replicable Patterns: How to Build an Agentic DSL

If you want to build a similar system, here are the patterns I extracted:

Pattern #1: External State > Internal Memory

// BAD: Relying on the model's "memory"
"Remember that you have already read these files..."

// GOOD: Injecting explicit state
prompt += `Files already read: ${readFiles.join(', ')}`
Enter fullscreen mode Exit fullscreen mode

Pattern #2: Reduce Branching When Possible

// BAD: Giving open choices
"Decide which operation to perform"

// GOOD: Forcing the only legal move
"Your NEXT action MUST be: continue_multi_file"
Enter fullscreen mode Exit fullscreen mode

Pattern #3: Decision Tables for Routing

| Input Pattern | Action | Next State |
|---------------|--------|------------|
| "add X"       | GATHER | EXECUTE    |
| "explain Y"   | RESPOND| END        |
Enter fullscreen mode Exit fullscreen mode

Instead of "think what to do", use "if X then Y".

Pattern #4: Custom Delimiters for Content Nesting

When you have to embed arbitrary content:

  • Don't use JSON (escaping nightmare)
  • Don't use XML (conflicts with HTML/JSX)
  • Use unique tags: #[content]...#[end-content]

Pattern #5: Redundancy = Coverage, Not Noise

Repeat critical rules:

  • In different formulations (semantic reinforcement)
  • In different contexts (contextual re-anchoring)
  • With different rationales (why, not just what)

What I Learned: Engineering > Elegance

After 5000 tokens and months of iterations, the most important lesson:

This prompt is not "beautiful". It is effective.

I stopped looking for:

  • ✗ The shortest possible prompt
  • ✗ The most elegant formulation
  • ✗ The most general abstraction

I started optimizing for:

  • ✓ Robustness in edge cases
  • ✓ Failure mode coverage
  • ✓ Debugging clarity when it fails

The result:

  • Redundant? Yes.
  • Verbose? Absolutely.
  • Works? Consistently.

Future Directions: Where I Could Go

If I were to evolve CodeForge 2.0, I would explore:

Two-Agent Architecture

Instead of a single-agent with 5000 tokens:

  • Planner Agent (2000 tokens): Decides strategy
  • Executor Agent (2000 tokens): Implements actions

Benefits:

  • Separation of concerns
  • Less context per agent
  • Parallel execution possible

Conclusions: State-of-the-Art for Coding Agents

Developing LLM CodeForge taught me that building reliable agentic systems means:

Accepting that LLMs are fundamentally unreliable, and designing around that fact.

The techniques that work:

  1. ⭐⭐⭐⭐⭐ State injection + forced next action
  2. ⭐⭐⭐⭐⭐ Decision tables for task routing
  3. ⭐⭐⭐⭐ Custom delimiters for structured output
  4. ⭐⭐⭐⭐ Contextual re-anchoring of invariants
  5. ⭐⭐⭐ Meta-validation as a safety net
  6. ⭐⭐ Visual hierarchy (useful but not critical)

The fundamental principle:

Don't ask the LLM to "understand" — force it to "execute".
Don't do prompt engineering — do protocol design in natural language.

When you build your next agentic system:

  • Treat it like a DSL, not a conversation
  • External state must constrain possible actions
  • Validate server or client side, always, without exceptions
  • Redundancy can be a feature, not a bug

LLMs are powerful tools, but they are probabilistic parsers, not deterministic compilers. Design accordingly.


Try CodeForge: https://llm-codeforge.netlify.app/

The project is open source — if you want to see the full prompt and the validation system implementation, you can find everything in the repository.

Questions for the community:

  • Have you ever built embedded DSLs in natural language?
  • What is the cost of "cognitive overhead" in your prompts?
  • Two-agent architecture vs single-agent: experiences?

Share in the comments — this is still largely unexplored territory.

Top comments (0)