Truong Phung

Posted on Apr 28

🤖 SWE-agent — Deep Dive & Build-Your-Own Guide 📘

#ai #llm #webdev #tutorial

A practical, code-level walkthrough of how SWE-agent (Princeton NLP / Stanford, NeurIPS 2024) actually works, why its Agent-Computer Interface (ACI) is the single most-imitated idea in coding agents today, and how you can build a similar autonomous coding agent from scratch.

Written April 2026. Based on the SWE-agent paper (arXiv 2405.15793), the EnIGMA paper (arXiv 2409.16165), the docs site, the SWE-ReX runtime, and the source of SWE-agent/SWE-agent.

💡 TL;DR — what SWE-agent is, in one paragraph
1. 🧠 The mental model — five moving parts
2. ⚙️ The agent loop — the canonical 30 lines
3. 🖱️ The Agent-Computer Interface — the central thesis
4. 📦 Tool bundles — YAML manifests + bash scripts
5. 📝 Prompts — what the agent actually sees
6. 🐳 The runtime — SWE-ReX
7. 🚀 Autonomy — what makes the loop run unattended
8. ⚙️ The configuration system
9. 🔐 EnIGMA — the multi-domain extension
10. 🏗️ Build-your-own — the 200-line minimum viable agent
11. 📊 Numbers — what SWE-agent actually scores
12. ⚠️ Lessons & limitations the team has stated
13. 🎯 The TL;DR — design rules to copy
📚 Sources

💡 TL;DR — what SWE-agent is, in one paragraph

SWE-agent is the open-source autonomous SWE agent that proved a then-controversial thesis: language models are a new kind of end user, and they need software interfaces designed for them, not for humans. The original v0.7 hit 12.5% on SWE-Bench (Full) with GPT-4 alone — roughly 3× the previous best — and v1.0+ reached SoTA on SWE-Bench Verified with Claude 3.7 Sonnet. Architecturally it's a tiny core: a stateless-ish DefaultAgent that runs a while not done loop, an SWEEnv that talks to a sandboxed shell via the SWE-ReX runtime, and a deck of YAML-defined "tool bundles" that get uploaded into the sandbox as plain bash/Python scripts. The agent's job is to emit one bash command per turn; the sandbox executes it; the resulting stdout becomes the next observation. Everything else — the windowed file viewer, the lint-on-edit guard, the search-with-truncation, the autosubmit-on-error — is just bash scripts that the LLM happens to call.

If you only remember three slogans:

🖥️ The interface is the model. A 100-line windowed viewer + a line-targeted edit + a syntax-checked autosave roughly doubles SWE-Bench score versus raw bash. The agent isn't smarter; it's better-housed.
📜 Tools are bash scripts in a YAML manifest. SWE-agent has no Python plugin system, no tool framework, no schema layer. A tool is a file in tools/<bundle>/bin/. The YAML tells the prompt how to call it. The runtime executes it as a bash command. This is the cheapest tool abstraction that works.
🔁 Autonomy = guarded loops + autosubmit. The agent runs until it submits, OR until it dies. On every fatal error — cost overrun, context overflow, timeout, syntax loop — the framework runs git diff one last time and ships whatever partial patch exists. Failure modes turn into degraded successes.

1. 🧠 The mental model — five moving parts

+-------------------+      +--------------------+      +--------------------+
|   DefaultAgent    |<-----|       SWEEnv       |<---->|   SWE-ReX runtime  |
| (loop, history,   |      | (problem, repo,    |      | (Local/Docker/Modal|
|   trajectory)     |      |  bundle install)   |      |  /Fargate session) |
+---------^---------+      +---------^----------+      +---------^----------+
          |                          |                           |
          | uses                     | exposes communicate()     | runs bash
          v                          v                           v
     +----------+              +-----------+              +---------------+
     |  Model   |              | Tool bundles|             | Persistent    |
     | (LiteLLM |              | (YAML +     |             | shell session |
     |  funcs)  |              |  bash/python)|            | + git repo    |
     +----------+              +-----------+              +---------------+

Five roles, sharp seams:

Agent (DefaultAgent in sweagent/agent/agents.py) — owns the while not done: loop. Holds three things: the chat history, a list of StepOutputs called the _trajectory, and a reference to the model + env. Emits one action per turn.
Environment (SWEEnv in sweagent/environment/swe_env.py) — wraps a SWE-ReX sandbox. Knows how to clone the target repo, install tool bundles, set env vars, and communicate(action) (send bash, receive stdout + exit code).
Tool bundles (tools/<bundle>/) — directories of YAML + bash + Python that get copied into the sandbox at startup. The YAML declares the agent-facing signature; the bin/ script is the actual implementation.
SWE-ReX (separate repo, SWE-agent/SWE-ReX) — the runtime. Provides a uniform "spawn a shell session, run a command, get output" API across Local, Docker, Modal, AWS Fargate, Daytona. Detects command completion via shell sentinels, not timeouts.
Model (LiteLLMModel in sweagent/agent/models.py) — wraps any provider via LiteLLM. Anthropic / OpenAI use native function-calling; everyone else falls back to a ThoughtActionParser that pulls the bash command out of a triple-backtick block.

Why this shape works

Property	How SWE-agent gets it
Universally-portable tools	Tools are bash scripts. They work on any Linux runtime SWE-ReX can spawn. No Python plugin loader, no SDK, no compiled artifact.
Replayable trajectories	Every `step()` writes a `.traj` JSON containing history, model output, observations, costs. `sweagent run-replay` re-executes any old run.
Swappable execution	Local for development, Docker for SWE-Bench, Modal/Fargate for parallel batch eval. Same agent code; just `--env.deployment.type=modal`.
No hidden state	The agent's "memory" is `self.history` (a flat list of messages) plus a JSON state blob from a tool's `state_command` (e.g. `{"open_file": "...", "working_dir": "..."}`) that's interpolated into every prompt. That's it.
Autonomous by default	Loop runs until `submit`, `exit`, cost-limit, timeout, or context-overflow. There is no interactive confirmation step. Errors don't crash — they trigger autosubmit.

The four design principles to steal

If you're building your own, lift these wholesale:

Don't reuse the human shell. cat, sed, grep -rn are all bad agent tools. They produce too much output, leak state, and have no error-recovery story. Build a tiny set of agent-shaped commands.
Tools must produce bounded, structured output. Every observation gets a header, a body, a footer. The agent always knows where the output starts and ends, what it just did, and where it is now.
Edits must be validated before they land. Free-form sed -i produces broken Python in subtle ways the model can't see. A line-range edit with a flake8 + auto-rollback gate eliminates an entire class of compounding error.
Stop conditions are explicit and many. Cost limit, step limit, timeout, consecutive-failure count, context overflow — wire all of them, and have every one of them produce a degraded success (autosubmit) rather than a hard failure.

2. ⚙️ The agent loop — the canonical 30 lines

Here's the actual DefaultAgent.run() loop, distilled. This is the single most important snippet in the project — read it twice.

# sweagent/agent/agents.py (paraphrased)
def run(self) -> Trajectory:
    self.setup()                          # render templates, install bundles, build first user msg
    step_output = StepOutput(done=False)
    while not step_output.done:
        step_output = self.step()
        self.save_trajectory()            # write .traj after every step
    return self._trajectory

def step(self) -> StepOutput:
    output = self.forward_with_handling(self.messages)
    self.add_step_to_history(output)
    self.add_step_to_trajectory(output)
    if output.submission:
        self.info["submission"] = output.submission
        self.info["exit_status"] = output.exit_status
    return output

def forward_with_handling(self, messages):
    for attempt in range(self.max_requeries):    # default 3
        try:
            return self.forward(messages)
        except FormatError:                      # parser couldn't extract an action
            messages = inject(self.templates.format_error_template)
        except _BlockedActionError:              # action matched a blocklist
            messages = inject(self.templates.action_blocked_template)
        except BashIncorrectSyntaxError:         # `bash -n` rejected the command
            messages = inject(self.templates.shell_check_error_template)
        except CommandTimeoutError:
            self._n_consecutive_timeouts += 1
            if self._n_consecutive_timeouts >= self.tools.config.max_consecutive_execution_timeouts:
                raise
            messages = inject(self.templates.command_cancelled_timeout_template)
    raise FormatError("Exceeded max_requeries")

def forward(self, messages) -> StepOutput:
    response   = self.model.query(messages)                 # LiteLLM call, tracks cost
    thought, action = self.tools.parse_actions(response)    # extract command
    if self.tools.should_block_action(action):
        raise _BlockedActionError()
    self._check_bash_syntax(action)                          # bash -n
    observation = self._env.communicate(action, timeout=...) # SWE-ReX round trip
    state       = self._env.communicate(self.tools.config.state_command)
    if self.tools.check_for_submission_cmd(observation):
        return self.handle_submission(observation, state)
    return StepOutput(thought=thought, action=action, observation=observation, state=state)

That's it. The whole project is a deck of templates, parsers, and bash scripts hung off this skeleton. Five phases worth memorizing:

Render the prompt by piping self.history through history processors (truncate old observations, mark cache anchors).
Call the model. Track cost; if instance_cost > per_instance_cost_limit, raise.
Parse the response into (thought, action) via the configured parser.
Validate the action (blocklist + bash -n). Bad actions go back to the model with a templated error message.
Execute in SWE-ReX, capture stdout, refresh state, check for submission sentinel.

Build-your-own: copy this 5-phase shape verbatim. The two non-obvious parts are (a) the requery loop on parse/validate failure and (b) the post-action state_command round-trip — both are what give SWE-agent its self-correcting feel.

3. 🖱️ The Agent-Computer Interface — the central thesis

The paper's argument, paraphrased: a human SWE uses VS Code, Sourcegraph, gdb. Each tool is built around human strengths (visual chunking, muscle memory, undo) and weaknesses (limited working memory, poor at tedium). An LM has the inverse profile — perfect recall of what's in context, terrible local memory of what just happened — so reusing human tools is malpractice.

The empirical claim from the paper's ablation and the docs at swe-agent.com/latest/background/aci/: the same GPT-4, with the same problem statement, scores roughly 2× on SWE-Bench when given ACI tools versus raw bash. That number is not a fluke — it's the difference between the agent appearing to work and not working.

Four ACI design rules (the operational definition)

Concise, bounded output. No cat-the-whole-file. No unbounded grep -rn. Every command's output has a maximum size and a structured shape.
Persistent state across turns. The runtime owns the cursor (CURRENT_FILE, FIRST_LINE). The agent never has to reconstruct "where am I" from history.
Guard rails on destructive actions. Edits run through a linter; failures auto-revert. Bash is -n-checked before execution.
Predictable, tiny argument grammar. Each command has 1–2 positional args, max. Multi-line bodies are bracketed by sentinels (end_of_edit) so they parse robustly.

The four flagship ACI tools

There are exactly four tool bundles that make up the original ACI: a windowed viewer, a search trio, a line-targeted edit-with-lint, and a submit. Everything else (Anthropic's str_replace_editor, the filemap viewer, multimodal helpers, EnIGMA's debugger) is later evolution.

3.1 👁️ `tools/windowed/` — the file viewer

Defines five commands:

# tools/windowed/config.yaml (verbatim)
tools:
  open:        { signature: 'open "<path>" [<line_number>]', ... }
  goto:        { signature: "goto <line_number>", ... }
  scroll_up:   { signature: "scroll_up", ... }
  scroll_down: { signature: "scroll_down", ... }
  create:      { signature: "create <filename>", ... }
state_command: "_state"

Behavior, exact:

Window size: 100 lines, configurable via WINDOW= env var.
Overlap: 2 lines. scroll_down advances first_line by WINDOW − OVERLAP = 98, so the bottom 2 lines of the previous view become the top 2 lines of the new view. The agent never loses local context across pagination.
goto N doesn't put N at the top. It puts N about 1/6 of the way down the window, so the agent gets ~17 lines of preceding context "for free". This compromise is in tools/windowed/lib/windowed_file.py as offset_multiplier = 1/6.
Line numbers are always shown. Format: 123: return foo(). No padding. No separator beyond the colon.
The window is bracketed by status lines:

  [File: /repo/src/foo.py (250 lines total)]
  (95 more lines above)
  96:def example():
  97:    return 1
  ...
  195:    pass
  (55 more lines below)

The first status line tells the agent the file's total length so it can plan; the "more lines above/below" markers give explicit relative position so the agent knows when to stop scrolling.

_state is the side-channel command run after every action; it returns JSON like {"open_file": "/repo/src/foo.py", "working_dir": "/repo"} which gets interpolated into the next prompt. The agent always knows where it is.

Why not cat? A 5000-line cat consumes more tokens than the rest of the trajectory combined, scrolls the actual point of interest off-screen, and provides no signal about which line is which. The viewer hands the agent a small bounded slice with explicit line numbers it can use as edit targets.

3.2 🔍 `tools/search/` — bounded search

Three commands: find_file, search_dir, search_file. The interesting one is search_dir, which is intentionally not grep -rn:

# tools/search/bin/search_dir (essential)
matches=$(find "$dir" -type f ! -path '*/.*' \
          -exec grep -nIH -- "$search_term" {} + | cut -d: -f1 | sort | uniq -c)
num_files=$(echo "$matches" | wc -l)
if [ $num_files -gt 100 ]; then
    echo "More than $num_files files matched ... Please narrow your search."
    return
fi
echo "Found $num_matches matches for \"$search_term\" in $dir:"
echo "$matches" | awk '{print $2 " ("$1" matches)"}'
echo "End of matches for \"$search_term\" in $dir"

Output:

Found 47 matches for "parse_args" in /repo:
./src/cli.py (12 matches)
./src/parser.py (28 matches)
./tests/test_cli.py (7 matches)
End of matches for "parse_args" in /repo

Three ACI choices encoded here:

Only filenames + counts, not the matching lines. A repo-wide grep would dump thousands of matching lines and waste 80% of context. Counts give the agent enough information to pick one file to read in detail.
Hard cap at 100 files. If exceeded, the tool refuses and tells the agent to narrow its search rather than silently truncating. This forces the agent to think instead of brute-forcing.
Explicit start/end markers ("Found ... matches" / "End of matches"). The agent can always tell where the tool's output begins and ends, even if the model concatenates several observations.

search_file (single-file) returns matching lines but caps at 100 and tells the agent to narrow. find_file is plain find -name with shell globs.

3.3 ✏️ `tools/windowed_edit_linting/` — the edit-with-rollback

This is the most carefully engineered piece of the ACI. The signature:

edit <start_line>:<end_line>
<replacement_text>
end_of_edit

The end_name: "end_of_edit" field in the YAML tells the parser that everything between edit N:M and end_of_edit is one multi-line argument. This is robust: triple-backtick delimiters get confused inside Python code; a unique sentinel string never collides.

The implementation in tools/windowed_edit_linting/bin/edit:

pre_edit_lint = flake8(wf.path)
wf.set_window_text(replacement_text, line_range=(start_line, end_line))
post_edit_lint = flake8(wf.path)

new_errors = format_flake8_output(
    post_edit_lint,
    previous_errors_string=pre_edit_lint,
    replacement_window=(start_line, end_line),
    replacement_n_lines=len(replacement_text.splitlines()),
)
if new_errors:
    with_edits    = wf.get_window_text(line_numbers=True, status_line=True, pre_post_line=True)
    wf.undo_edit()                # REVERT
    without_edits = wf.get_window_text(...)
    print(_LINT_ERROR_TEMPLATE.format(errors=new_errors,
                                      window_applied=with_edits,
                                      window_original=without_edits))
    exit(1)

The linter is flake8 with a deliberately tight ruleset:

flake8 --isolated --select=F821,F822,F831,E111,E112,E113,E999,E902 <file>

F821/F822 — undefined names
F831 — duplicate args
E111/E112/E113 — indentation errors
E999 — syntax error
E902 — IO/tokenization error

This is not general-purpose linting (no E1xx style, no W warnings). It's a hand-picked "did your edit produce broken Python" checklist.

The diff filter is the cleverest detail. Pre-existing errors are shifted by (replacement_n_lines − (end_line − start_line + 1)) so the agent isn't blamed for issues that were already there. Errors that fall inside the replaced window are the agent's responsibility now. Only newly-introduced errors trigger the rollback.

On failure, the agent sees:

Your proposed edit has introduced new syntax error(s). Please read this error
message carefully and then retry editing the file.

ERRORS:
- undefined name 'foo'

This is how your edit would have looked if applied
------------------------------------------------
[File: ... 130 lines total]
(95 more lines above)
96:def bar():
97:    return foo()   <-- the broken state
...
------------------------------------------------

This is the original code before your edit
------------------------------------------------
[same window, original state]
------------------------------------------------

Your changes have NOT been applied. Please fix your edit command and try again.
DO NOT re-run the same failed edit command. Running it again will lead to the same error.

Three details that matter:

The bad state is shown side-by-side with the kept state, both with line numbers, so the agent can mentally diff without having to ask.
The change was rolled back automatically. The model can't accumulate broken code over multiple turns.
The "DO NOT re-run" sentence is empirically motivated. Without it, agents often loop on identical failed edits. Prompt engineering plugs the gap that no detector fully closes.

3.4 ✅ `tools/submit/` — the done signal

The smallest tool, but it's the one that makes the loop autonomous:

# tools/submit/bin/submit
main() {
    cd $ROOT
    if [ -s "/root/test.patch" ]; then
        git apply -R < "/root/test.patch"   # reverse-apply gold tests
    fi
    git add -A
    git diff --cached > /root/model.patch
    echo "<<SWE_AGENT_SUBMISSION>>"
    cat /root/model.patch
    echo "<<SWE_AGENT_SUBMISSION>>"
}

The agent loop greps for that sentinel in the observation: tools.check_for_submission_cmd(output) → True. That sets step.done = True and the run terminates. submit does not run tests — that's deliberate. SWE-Bench evaluates externally; making submit run tests would let the agent peek at the gold-truth signal.

The sentinel pattern is worth borrowing: a unique string (<<SWE_AGENT_SUBMISSION>>, never a normal substring) means you don't need a separate "done" channel. Any tool can declare completion just by emitting it. EnIGMA uses the same pattern with ###SWE-AGENT-EXIT-FORFEIT###.

4. 📦 Tool bundles — YAML manifests + bash scripts

This is the cheapest tool abstraction that works, and it's worth understanding why.

4.1 The bundle layout

Every directory under tools/ has the same shape:

tools/<bundle>/
  config.yaml          # tool declarations: signature, docstring, args, end_name, state_command
  install.sh           # runs once at sandbox startup (pip install flake8, etc.)
  bin/
    <tool_name>        # one executable per command — bash or python+shebang
  lib/                 # shared helpers (windowed_file.py, flake8_utils.py)
  pyproject.toml       # optional, only if the bundle has Python deps

A Bundle object (sweagent/tools/bundle.py) loads the directory, parses config.yaml, and at sandbox startup uploads the entire bundle, adds bin/ to $PATH, and sources install.sh. After that, the agent's actions are just bash — when it types edit 10:15\nfoo\nend_of_edit, the runtime invokes tools/windowed_edit_linting/bin/edit because that's first on PATH.

4.2 The bundles that ship

Bundle	What it provides
`windowed`	open, goto, scroll_up, scroll_down, create, _state
`windowed_edit_linting`	line-range edit with flake8 + autorollback
`windowed_edit_replace` / `windowed_edit_rewrite`	edit variants (replace string / full file rewrite)
`edit_anthropic`	Anthropic's `str_replace_editor` (view/create/str_replace/insert/undo_edit)
`search`	find_file, search_dir, search_file
`submit`	the done sentinel
`forfeit`	clean give-up (`exit_forfeit`)
`filemap`	condensed Python file viewer (skips long bodies, shows just signatures)
`diff_state`	exposes git diff to the agent
`review_on_submit_m`	injects a reviewer LLM critique on submit
`image_tools`	multimodal helpers
`web_browser`	browser primitives (used by EnIGMA / multimodal)
`multilingual_setup`	cross-language test harness setup
`registry`	shared state registry — must be first in `bundles:` list

A config picks bundles à la carte:

# config/anthropic_filemap.yaml — the v1.0+ default
agent:
  tools:
    bundles:
      - path: tools/registry            # always first
      - path: tools/edit_anthropic
      - path: tools/review_on_submit_m
    registry_variables:
      USE_FILEMAP: 'true'
    enable_bash_tool: true               # raw bash always available
    parse_function:
      type: function_calling

The classic v0.7 paper config uses windowed + windowed_edit_linting + search + submit instead — and that's the version closest to the ACI thesis.

4.3 Why YAML + bash, not Python plugins

Three reasons:

Universal portability. A bash script runs in any Linux container. A Python plugin requires the runtime to import SWE-agent's package, which couples the sandbox to your code.
Trivial install path. cp -r tools/<bundle> $SANDBOX/root/tools/ && bash install.sh. There's no version-skew problem, no plugin loader, no FFI boundary.
Tool docs come for free. The YAML signatures + docstrings render directly into the system prompt ({{command_docs}}) or into the function-calling JSON-Schema. No separate "tool description" file to keep in sync.

The trade-off: tools can't share state through Python objects. They share state through (a) env vars in the runtime, (b) files on disk in the runtime, and (c) the _state command's JSON output. This is fine — there's basically no cross-tool coupling needed in practice.

4.4 Build-your-own bundle — the minimal spec

# tools/my_bundle/config.yaml
tools:
  my_command:
    signature: "my_command <arg1> [<arg2>]"
    docstring: "What this tool does, written for the LM."
    arguments:
      - { name: arg1, type: string,  required: true,  description: "..." }
      - { name: arg2, type: integer, required: false, description: "..." }
state_command: "_my_state"   # optional; runs after every action; emits JSON
env_variables:               # optional; bundle-scoped env injected at install
  MY_VAR: "default"

# tools/my_bundle/bin/my_command
#!/usr/bin/env bash
arg1="$1"
arg2="${2:-}"
# do the thing, echo bounded output

# tools/my_bundle/install.sh
pip install some-dep

That's the whole protocol.

5. 📝 Prompts — what the agent actually sees

SWE-agent has two prompt regimes that ship in the repo. Knowing both is useful: the modern function-calling one is what runs in production; the classic ReAct one is closer to the paper and easier to port to weaker models.

5.1 The modern default (`config/default.yaml`)

agent:
  templates:
    system_template: |-
      You are a helpful assistant that can interact with a computer to solve tasks.

    instance_template: |-
      <uploaded_files>
      {{working_dir}}
      </uploaded_files>
      I've uploaded a python code repository in the directory {{working_dir}}.
      Consider the following PR description:

      <pr_description>
      {{problem_statement}}
      </pr_description>

      Can you help me implement the necessary changes ... ?
      I've already taken care of all changes to any of the test files described in the
      <pr_description>. ...

      Follow these steps to resolve the issue:
      1. As a first step, find and read code relevant to the <pr_description>
      2. Create a script to reproduce the error and execute it with `python <filename.py>`
      3. Edit the source code of the repo to resolve the issue
      4. Re-run your reproduce script and confirm that the error is fixed!
      5. Think about edge cases and make sure your fix handles them as well

      Your thinking should be thorough and so it's fine if it's very long.

    next_step_template: |-
      OBSERVATION:
      {{observation}}
    next_step_no_output_template: |-
      Your command ran successfully and did not produce any output.

Three things to notice:

The system prompt is one line. All task context lives in the instance template. This is on purpose — system prompts are cached separately and shouldn't carry per-task content.
The five-step recipe is the prompt encoding the human SWE workflow. "Read, repro, fix, re-run, edge cases." This is the single most-copied agent prompt pattern, and it works because it gives the LM procedural memory it doesn't have natively.
No demonstrations. The v1.0 default dropped few-shot examples entirely. Strong base models do better without them; weaker ones benefit. (The classic config/sweagent_0_7/ configs still load .traj demonstrations.)

5.2 The classic ReAct prompt (`config/sweagent_0_7/07_thought_action.yaml`)

SETTING: You are an autonomous programmer, and you're working directly in the
command line with a special interface.

The special interface consists of a file editor that shows you {{WINDOW}} lines
of a file at a time. In addition to typical bash commands, you can also use the
following commands to help you navigate and edit files.

COMMANDS:
{{command_docs}}

Please note that THE EDIT COMMAND REQUIRES PROPER INDENTATION. ...

RESPONSE FORMAT:
Your shell prompt is formatted as follows:
(Open file: <path>) <cwd> $

You need to format your output using two fields; discussion and command:
DISCUSSION
First I'll start by using ls to see what files are in the current directory...

ls -a


You should only include a *SINGLE* command in the command section ...

The {{command_docs}} placeholder is rendered from the YAML signatures — if you swap bundles, the prompt updates automatically. The shell prompt (Open file: <path>) <cwd> $ is fed back every step via next_step_template, giving stateless models a constant reminder of position.

The instance template includes empirically-derived tips:

"If you open a file and need to get to an area around a specific line that is not in the first 100 lines, say line 583, don't just use the scroll_down command multiple times. Instead, use the goto 583 command."
"If a command did not work once it will not work the second time unless you modify it!"
"When editing files, it is easy to accidentally specify a wrong line number ... Always check the code after you issue an edit."

Each of these is plugging a specific failure mode the team observed in trajectories.

5.3 Action parsing — the `parse_function` dispatch

sweagent/tools/parsing.py ships seven parsers; the config picks one:

Parser	Format expected	Use case
`function_calling`	Native `tool_calls` field on the response	Anthropic / OpenAI / any LiteLLM provider with tool support — default
`ThoughtActionParser`	Free text + first `bash ...` block	Classic ReAct, weaker / older models
`XMLThoughtActionParser`	`<command>...</command>`	Models that parse XML better than markdown
`XMLFunctionCallingParser`	`<function=name><parameter=k>v</parameter></function>`	Anthropic-XML style
`EditFormat`	Replace-block diff format	Edit-only workflows
`JsonParser`	`{"thought": "...", "command": {...}}`	Pure JSON-mode models
`ActionOnlyParser`	Whole response is the action	o1-style hidden-CoT models

Selection is agent.tools.parse_function.type in the YAML. The parser also implements should_block_action (block list of disallowed commands like vim or nano that would hang the shell).

5.4 History processors — keeping context bounded

A pluggable transformation layer applied to messages before each model call. Two ship by default:

history_processors:
  - type: cache_control       # adds Anthropic prompt-cache breakpoints
    last_n_messages: 2
  - type: last_n_observations # truncates old observations
    n: 5

cache_control marks the last 2 messages with cache_control: ephemeral so Anthropic's prompt cache fires every turn — older history is stable and cacheable, only the trailing turns change.

last_n_observations is the brutal one: it drops all but the most recent N observations from the messages array, keeping the actions and thoughts in place but blanking out the stdout of older steps. For a 50-turn trajectory, this is the difference between context overflow and a clean run. The default n=5 is aggressive but works; the agent's own thoughts + recent state command output gives it enough to keep going.

image_observations does the multimodal equivalent (converts image bytes into vision content blocks).

6. 🐳 The runtime — SWE-ReX

SWE-agent v1.0+ delegates all sandboxing to a separate package, SWE-ReX. The README tagline:

"SWE-ReX is a runtime interface for interacting with sandboxed shell environments, allowing you to effortlessly let your AI agent run any command on any environment."

What SWE-ReX provides

A uniform Runtime API:

create_session(name) — spawn a persistent shell. The agent can have multiple — one for the main repo, one for gdb, one for nc.
run_in_session(session, command, timeout) — send command, capture stdout, return (output, exit_code). Detects completion via shell sentinels, not timeouts.
read_file(path) / write_file(path, content) — direct FS access, separate from the shell.
is_alive() — health check.

Backends, all behind the same API:

Local — just subprocess. Fast for development.
Docker — for SWE-Bench reproduction.
Modal — pip install swe-rex[modal]. Serverless containers; 1000+ in parallel.
AWS Fargate — pip install swe-rex[fargate]. Long-running cloud sandboxes.
Daytona — work in progress.

Why it was spun out

Three reasons spelled out in the SWE-ReX README:

Massive parallelism for benchmarking. Running SWE-Bench Verified (500 instances) sequentially takes days. Running it on 50 parallel Modal sandboxes takes ~1 hour. Wiring Docker into the agent loop directly made parallelism brittle.
Non-Linux dev experience. A Mac developer without Docker should still be able to run the agent locally. SWE-ReX's Local backend just shells out — no container needed.
Cleaner agent code. await env.communicate(action, timeout=10) is the entire API surface the agent sees. No Modal IDs, no Fargate task ARNs, no Docker exec leaking into the agent loop.

Sentinel-based completion detection

A subtle but important detail: when you send a command to a persistent shell, you can't tell when it's "done" without (a) waiting forever or (b) timing out. SWE-ReX appends a unique sentinel string to every command and watches stdout for it:

my_command; echo "###SWE-REX-COMPLETE-<unique-id>###"

When the sentinel appears, the shell is idle again. This is what makes interactive subprocesses — gdb, ipython, nc — work as agent tools. The agent can gdb_break main, gdb_run, gdb_step across multiple turns and the gdb session stays alive between them.

Build-your-own runtime — the minimal spec

If you don't want to depend on SWE-ReX, the absolute minimum is:

class Runtime:
    async def communicate(self, command: str, timeout: int = 30) -> tuple[str, int]:
        # send command to a persistent shell, wait for completion sentinel,
        # return (stdout, exit_code)
        ...
    async def write_file(self, path: str, content: str) -> None: ...
    async def read_file(self, path: str) -> str: ...

Implementations: subprocess.Popen with stdin/stdout pipes for local, docker exec for Docker, anything for cloud. The agent doesn't care.

7. 🚀 Autonomy — what makes the loop run unattended

This is the section the user asked about specifically. SWE-agent runs for hours without human input; here's exactly how.

7.1 🔚 Termination paths — the four ways the loop ends

while not step_output.done:
    step_output = self.step()

done = True is set by exactly four mechanisms, in priority order:

submit — agent emits the submit command, sandbox writes /root/model.patch, prints <<SWE_AGENT_SUBMISSION>>. check_for_submission_cmd() flips done. Exit status: submitted.
exit / exit_forfeit — agent voluntarily quits. Exit status: exit_command / exit_forfeit.
Autosubmit on error — any of the catastrophic exceptions (cost limit, context overflow, total execution time, consecutive timeouts, content policy) is caught by handle_error_with_autosubmission(), which runs git diff one last time, ships whatever partial patch exists, and exits. Exit status: exit_cost, exit_context, etc.
Format-error budget exhausted — max_requeries (default 3) consecutive parse / blocklist / bash-syntax failures trigger autosubmit. Exit status: exit_format.

The crucial design choice: every error path ends in autosubmit, not crash. SWE-Bench scores partial credit; a half-fixed bug is worth more than a thrown exception.

7.2 💰 The budgets — exact defaults

From GenericAPIModelConfig in sweagent/agent/models.py:

per_instance_cost_limit: float = 3.0      # USD — primary budget knob
total_cost_limit: float = 0.0             # 0 = unlimited
per_instance_call_limit: int = 0          # 0 = unlimited; max LLM calls per task
temperature: float = 0.0
top_p: float | None = 1.0
max_input_tokens: int | None = None
max_output_tokens: int | None = None

And from ToolConfig:

total_execution_timeout: int                 # cumulative bash time per instance
execution_timeout: int                       # per-command timeout
max_consecutive_execution_timeouts: int = 5  # raises after 5 timeouts in a row
max_observation_length: int = 100_000        # chars; truncated above this

Cost is the primary stop signal, not steps. The team explicitly chose cost over steps because step counts vary wildly across models — Claude-3-Opus uses few long steps, GPT-4-Turbo uses many short ones. A $3-per-task budget normalizes this.

When instance_cost > per_instance_cost_limit, LiteLLM raises CostLimitExceededError. SWE-agent catches it and triggers autosubmit. For the RetryAgent (multi-attempt), the unused portion of the budget is re-allocated to the next attempt, so a cheap-failed-fast attempt leaves more budget for the retry.

7.3 🪜 Error recovery — the requery ladder

Every step is wrapped in this ladder (forward_with_handling):

Exception	What happens	Counts toward `max_requeries`?
`FormatError` (parser couldn't extract action)	Re-prompt with `format_error_template`	yes
`_BlockedActionError` (action on blocklist)	Re-prompt with `action_blocked_template`	yes
`BashIncorrectSyntaxError` (`bash -n` failed)	Re-prompt with `shell_check_error_template`	yes
`ContentPolicyViolationError` (provider refused)	Resample	yes
`CommandTimeoutError`	Re-prompt with `command_cancelled_timeout_template`, increment timeout counter	no (timeout has its own counter)
`ContextWindowExceededError`	Autosubmit + exit	—
`CostLimitExceededError`	Autosubmit + exit	—
`TotalCostLimitExceededError`	Hard exit (cross-instance, no autosubmit)	—
`RetryError`, `SwerexException`, generic `RuntimeError`	Autosubmit + exit	—

Three layers of guard rail compound:

bash -n syntax check before execution. If bash -n my_cmd fails, the agent never runs the malformed command — it sees Your bash command contained syntax errors and was NOT executed. ... and gets to retry.
flake8 + auto-revert on edits. Same idea, one layer up: if the edit produces broken Python, it's never persisted.
Format-error requery. If the model's response can't be parsed as a valid action, the framework re-prompts up to max_requeries=3 times before giving up.

Each layer catches a different failure mode; together they make the agent feel "self-correcting" even though the LM is just being asked to try again with more context.

7.4 🛟 The autosubmit pattern — degraded success over hard failure

This is the single most important autonomy pattern in SWE-agent, and it's worth lifting verbatim:

def handle_error_with_autosubmission(self, exit_status: str, message: str):
    try:
        # Run submit one more time, capture whatever git diff exists
        observation = self._env.communicate("submit", timeout=10)
        submission = self.extract_submission(observation)
    except Exception:
        submission = ""    # even submit failed, ship empty
    return StepOutput(
        done=True,
        exit_status=exit_status,    # "exit_cost", "exit_context", etc.
        submission=submission,
        observation=message,
    )

Why this matters: a 30-step trajectory that hits a cost limit at step 31 has probably made some real progress. A git diff of the work-in-progress is a partial patch that may pass some tests. Throwing the trajectory away for an exception is a bug; autosubmitting is the feature.

7.5 🚫 No semantic stuck detection — and that's deliberate

SWE-agent has no code that detects "you've made the same edit 5 times in a row." The team tried such detectors and abandoned them: false positives were too high, and the cost limit naturally kicks in if the agent loops anyway. Instead, the only "stuck" mitigation is prompt-level:

The lint error template includes "DO NOT re-run the same failed edit command. Running it again will lead to the same error."
The instance template includes "If a command did not work once it will not work the second time unless you modify it!"

Both are textual hints that empirically reduce loop frequency. Neither is enforced.

This is a design lesson: don't add a guardrail unless you can show its false-positive rate is low. SWE-agent's existing guardrails (cost, syntax check, lint+revert) are all 100%-precision: they only fire when something is definitely wrong. Semantic stuck detection isn't.

8. ⚙️ The configuration system

YAML-first. The CLI is built on simple-parsing so any YAML key is also addressable as a --dotted.path flag, and --config foo.yaml --config bar.yaml merges multiple files.

8.1 Single-task invocation

sweagent run \
  --agent.model.name=claude-sonnet-4-20250514 \
  --agent.model.per_instance_cost_limit=2.00 \
  --env.repo.github_url=https://github.com/SWE-agent/test-repo \
  --problem_statement.github_url=https://github.com/SWE-agent/test-repo/issues/1

Useful entries:

Flag	What it does
`--config foo.yaml`	Merge a config file (can be passed multiple times)
`--env.repo.path=test-repo`	Use a local repo instead of GitHub
`--env.deployment.image=python:3.12`	Custom Docker image
`--env.deployment.type=modal`	Use Modal serverless sandbox
`--env.post_startup_commands='["pip install -e ."]'`	Project-specific install
`--problem_statement.path=/path/problem.md`	Markdown problem statement
`--problem_statement.text="..."`	Inline problem statement
`--actions.apply_patch_locally`	Also write the resulting patch into your working tree
`--actions.open_pr`	Auto-open a PR with the diff

8.2 Batch / SWE-Bench

sweagent run-batch \
  --config config/default.yaml \
  --agent.model.name gpt-4o \
  --agent.model.per_instance_cost_limit 2.00 \
  --instances.type swe_bench \
  --instances.subset lite \
  --instances.split dev \
  --instances.slice :3 \
  --num_workers 8

Outputs a directory of .traj files plus preds.json (compatible with the SWE-Bench evaluation harness via sb-cli).

8.3 The three top-level YAML sections

agent:                # how the agent thinks
  type: default       # or 'retry' for multi-attempt
  templates: {...}
  tools: {...}
  history_processors: [...]
  model: {...}

env:                  # where the agent runs
  repo: {...}
  deployment: {...}   # local / docker / modal / fargate
  post_startup_commands: [...]

problem_statement:    # what the agent should do
  type: github_issue  # or text, file
  github_url: "..."

8.4 The `RetryAgent` for multi-attempt

agent:
  type: retry
  retry_loop:
    cost_limit: 6.00
    max_attempts: 3
  agent_configs:
    - <full DefaultAgent config #1>
    - <full DefaultAgent config #2>     # maybe different prompt or model

Each attempt gets the unused budget from the previous attempt's failure, so retry doesn't cost 3× the single-shot version. A reviewer LLM (via tools/review_on_submit_m) can decide which attempt's patch to keep; in practice the team reports mixed results — sometimes the reviewer rejects correct patches.

9. 🔐 EnIGMA — the multi-domain extension

The same agent framework, applied to offensive cybersecurity / CTF tasks. Paper: Abramovich et al., ICML 2025. Result: 13.5% on NYU CTF (200 challenges, ~3× prior SoTA), plus SoTA on Intercode-CTF and CyBench across 390 total challenges.

EnIGMA's contributions are worth understanding because they generalize the ACI design beyond Python repo editing:

9.1 Interactive Agent Tools (IAT)

SWE-agent's bash tools are one-shot: run, capture output, return. EnIGMA adds stateful subprocess tools — gdb, nc, ipython — where state spans many turns. The agent does:

gdb_start ./binary
gdb_break main
gdb_run
gdb_print $rdi
gdb_step
...

Each call is a separate agent action; the gdb session persists in a dedicated SWE-ReX shell session. This generalizes the windowed file viewer (which already maintained state across turns) to arbitrary REPL-style tools.

9.2 Three new tool families

Debugger interface — gdb wrapped as agent commands: breakpoints, register inspection, memory dumps, single-stepping.
Decompiler interface — wraps ghidra / radare2 so the agent can pull pseudocode for a binary.
Server interaction tools — connect_start <host> <port>, connect_sendline <data>, connect_recv — exposes a netcat-like session for talking to CTF challenge servers.

9.3 Soliloquizing — a failure mode worth naming

The EnIGMA paper formally identifies a failure that SWE-agent had been quietly hitting too: the model hallucinates observations rather than waiting for real tool output, especially for slow / interactive tools. The model writes "I'll run gdb" and then writes its own pretend gdb output as if it had executed.

The mitigation is part-prompt, part-runtime: the prompt explicitly tells the model "wait for the tool's actual response," and the runtime enforces real round-trips through SWE-ReX (which gives the agent no way to "see" the response without actually executing). The IAT abstraction explicitly guards against this — the agent literally cannot proceed without a real reply from the gdb session.

This generalizes: any agent that supports interactive tools needs to consider this failure mode. The fix is structural (the agent loop alternates strictly between LM call and tool execution, with no way for the model to skip a tool call) rather than detection-based.

10. 🏗️ Build-your-own — the 200-line minimum viable agent

If you wanted to ship a SWE-agent-shaped tool in a weekend, here's the minimum. This is roughly what the companion mini-swe-agent project (~100 lines, hits 65% on SWE-Bench Verified) demonstrates: most of the lift comes from the prompt + ACI tools, not the framework.

10.1 The agent loop

import subprocess, json
import litellm

SYSTEM = "You are an autonomous programmer working in a sandbox."
INSTANCE = """I have a repo at {workdir}. Fix this issue:
<issue>
{problem_statement}
</issue>

You have these commands: open <file>, goto <line>, scroll_up, scroll_down,
edit <start>:<end>\\n<replacement>\\nend_of_edit, search_dir <term>, search_file <term>,
submit, plus any bash. Format your response as: thought, then a single ```

bash ...

 ``` block.
"""

class Agent:
    def __init__(self, model, workdir, problem_statement, cost_limit=3.0):
        self.model = model
        self.history = [
            {"role": "system", "content": SYSTEM},
            {"role": "user",   "content": INSTANCE.format(workdir=workdir, problem_statement=problem_statement)},
        ]
        self.cost = 0.0
        self.cost_limit = cost_limit
        self.workdir = workdir
        self.done = False
        self.submission = None

    def run(self):
        while not self.done:
            try:
                self.step()
            except CostLimitExceeded:
                self.autosubmit("exit_cost"); break
            except Exception as e:
                self.autosubmit(f"exit_error: {e}"); break
        return self.submission

    def step(self):
        if self.cost > self.cost_limit:
            raise CostLimitExceeded()
        resp = litellm.completion(model=self.model, messages=self.history)
        self.cost += resp._hidden_params["response_cost"]
        text = resp.choices[0].message.content
        thought, action = parse_thought_action(text)         # extract last ```
{% endraw %}
...
{% raw %}
``` block
        if not action:
            self.history.append({"role": "user", "content": "Please respond with a thought + ``...`` block."})
            return
        # syntax check
        if subprocess.run(["bash", "-n", "-c", action], capture_output=True).returncode != 0:
            self.history.append({"role": "user", "content": "Bash syntax error. Try again."})
            return
        # execute
        result = subprocess.run(["bash", "-c", action], capture_output=True, text=True, cwd=self.workdir, timeout=60)
        observation = result.stdout + result.stderr
        if "<<SWE_AGENT_SUBMISSION>>" in observation:
            self.done = True
            self.submission = self.run_git_diff()
            return
        # state command
        state = self.run_state()
        observation = f"{observation}\n[Open file: {state['open_file']} | cwd: {state['workdir']}]"
        # truncate long observations
        if len(observation) > 100_000:
            observation = observation[:50_000] + "\n... [truncated] ...\n" + observation[-50_000:]
        self.history.append({"role": "assistant", "content": text})
        self.history.append({"role": "user",      "content": f"OBSERVATION:\n{observation}"})

    def autosubmit(self, exit_status):
        self.submission = self.run_git_diff()
        self.done = True

10.2 The four ACI tools as bash scripts

Each one ~30 lines. The windowed viewer needs a state file (/tmp/agent_state.json) for cursor position; the linter needs flake8.

# bin/open
file="$1"; line="${2:-1}"
echo "{\"open_file\": \"$file\", \"first_line\": $line}" > /tmp/agent_state.json
WINDOW=100; total=$(wc -l < "$file")
end=$((line + WINDOW)); above=$((line - 1)); below=$((total - end))
echo "[File: $file ($total lines total)]"
[ $above -gt 0 ] && echo "($above more lines above)"
sed -n "${line},${end}p" "$file" | nl -ba -s: -w1
[ $below -gt 0 ] && echo "($below more lines below)"

# bin/edit  (paraphrased)
file=$(jq -r .open_file /tmp/agent_state.json)
range="$1"; start="${range%%:*}"; end="${range##*:}"
shift; replacement=$(cat)   # rest of stdin
cp "$file" "$file.bak"
{ head -n $((start-1)) "$file.bak"; echo "$replacement"; tail -n +$((end+1)) "$file.bak"; } > "$file"
# lint
new_errors=$(flake8 --select=F821,F822,E999 "$file")
old_errors=$(flake8 --select=F821,F822,E999 "$file.bak")
diff_errors=$(diff <(echo "$old_errors") <(echo "$new_errors") | grep '^>')
if [ -n "$diff_errors" ]; then
    cp "$file.bak" "$file"   # revert
    echo "Edit rejected — flake8 errors:"; echo "$diff_errors"
    echo "Edit not applied. DO NOT re-run the same failed edit."
    exit 1
fi
echo "Edit applied to $file."

# bin/submit
git -C "$1" add -A
git -C "$1" diff --cached
echo "<<SWE_AGENT_SUBMISSION>>"

10.3 What you've reproduced

That ~250 lines (loop + 4 tools) is enough to:

Run an LM agent against a Python repo with bounded windowed reads, line-targeted lint-checked edits, and bash fallback.
Auto-submit on cost overrun or any exception.
Track cost and stop at a budget.
Refuse to execute syntactically broken bash.

It won't hit 65% on SWE-Bench Verified — that takes the prompt tuning, the multi-tool history processors, and a frontier model. But it will autonomously run for hours unattended without crashing, which is the core of the value.

10.4 The high-leverage upgrades

If you have one extra week, these are the things to add in priority order:

Persistent shell session. Replace subprocess.run with a Popen-based persistent bash that survives across actions. Many real workflows depend on cd, source venv/bin/activate, etc.
last_n_observations history processor. Drop stale observations from the prompt, keep thoughts and actions. This is the difference between a 50-turn run and a 10-turn run.
Function-calling parser for Anthropic / OpenAI. Replaces backtick parsing for stronger structure and cheaper cache hits.
Trajectory recording. Dump JSON after every step. Worth it just for debugging.
Docker backend. Wrap the runtime in a Runtime interface, ship a Docker implementation. This unlocks running multiple instances in parallel.
A reviewer LLM on submit. Inject a "are you sure" critique before accepting submission. Mixed empirical results, but easy to add.

11. 📊 Numbers — what SWE-agent actually scores

Metric	Result	Model
SWE-Bench (Full) v0.7 paper	12.29% pass@1	GPT-4
SWE-Bench Lite v0.7 paper	18.0%	GPT-4
HumanEvalFix v0.7 paper	87.7%	GPT-4
SWE-Bench Verified v1.0+	SoTA at release (~62-66%)	Claude 3.7 Sonnet
SWE-Bench (Full) v1.0.1	SoTA at release	Claude 3.7 Sonnet
Mini-SWE-Agent (100 LOC)	65%	Claude 3.7 Sonnet
NYU CTF (EnIGMA)	13.5%	GPT-4
Mean cost per SWE-Bench instance	~$2 (under $4 cap)	GPT-4

The most important number for the thesis: the ablation against the bash-only config drops the score from ~12% to ~3%. Without ACI, GPT-4 + raw bash is no better than the original RAG baseline. The interface really is the model.

Repo activity: ~19.1k stars, ~2.1k forks, actively maintained, MIT licensed.

12. ⚠️ Lessons & limitations the team has stated

From the paper, the v1.0 release notes, and the docs:

Long files break naive cat. First version let the model cat whole files — context blew up within 2-3 turns. Switching to a windowed viewer was the single biggest delta.
sed-based edits fail. Letting the model edit by sed -i caused frequent syntactic collapses (escape handling, especially). The line-range edit + end_of_edit + flake8 + autorollback block fixed this.
Verbose tool output destroys reasoning. Capping search_dir results, suppressing pip progress bars (PIP_PROGRESS_BAR: off, TQDM_DISABLE: 1, PAGER: cat), and turning off interactive prompts were necessary, not optional. These env vars are baked into every default config.
Demonstrations don't generalize across model families. GPT-4-tuned demos hurt Claude. v1.0 default drops demos entirely; the system + instance template carries the load.
Soliloquizing (EnIGMA paper): models hallucinate fake stdout when the real tool is too slow/interactive. Mitigation: enforce real round-trips through the runtime; never let the model "skip" a tool call.
Step-budget vs. cost-budget. v1.0 explicitly chose cost-budget over step-budget because step counts vary 5× across model families. Default per_instance_cost_limit=$3 is the operative ceiling.
The reviewer-on-submit experiment was mixed. Sometimes the reviewer rejects correct patches. It's opt-in, not on by default.
Multi-agent flows didn't pay off. The RetryAgent + reviewer was the closest the team got to multi-agent; sub-agent delegation was tried and abandoned for SWE tasks (the EnIGMA team had better luck because CTF challenges decompose more cleanly).
What was abandoned: an in-context filesystem tree dump (too noisy; replaced by filemap for Python and find_file for navigation); a "diff against base" tool (collapsed into submit); semantic stuck detection (false positives too high).

13. 🎯 The TL;DR — design rules to copy

If you take only the build-rules and skip the architecture:

Build agent-shaped tools, not human-shaped tools. Bounded output, persistent state in the runtime, fixed argument grammar, side-by-side before/after error messages.
Validate destructive actions before they land. bash -n for shell, flake8 for Python, with diff-based filtering so the agent isn't blamed for pre-existing issues.
Make every termination path produce a degraded success. Cost limit, context overflow, total timeout, consecutive-error count — all of them autosubmit, none of them crash.
Use cost as the budget, not steps. Step counts vary too much across models.
Define tools as YAML manifests + bash scripts, not Python plugins. Universal portability across runtimes; trivial install path; tool docs render directly into the prompt.
Use a sentinel pattern for done detection. A unique string in the observation (<<SWE_AGENT_SUBMISSION>>) means any tool can declare completion without a separate channel.
Run history through processors before every model call. cache_control for prompt caching, last_n_observations for context bound. Both are cheap and indispensable.
Separate the runtime from the agent. SWE-ReX exists because hard-coding Docker into the agent loop made parallelism brittle and the dev experience bad. A Runtime abstraction with Local/Docker/Modal/Fargate backends lets you scale up later.
Don't add semantic guardrails unless they're 100% precision. Prompt-level hints ("DO NOT re-run the same failed command") work better than detectors when the false-positive cost is high.
Optimize for the prompt + tools, not the framework. Mini-SWE-Agent gets 65% on SWE-Bench Verified in 100 lines. The framework's job is to be invisible; the prompt and the tools are doing the work.

📚 Sources

SWE-agent GitHub repository — main repo
SWE-agent docs — usage tutorials, ACI background
SWE-agent paper (arXiv 2405.15793) — Yang, Jimenez, Wettig, Lieret, Yao, Narasimhan, Press (NeurIPS 2024)
EnIGMA paper (arXiv 2409.16165) — Abramovich et al. (ICML 2025)
SWE-ReX runtime
Mini-SWE-Agent — 100-LOC reference implementation
SWE-Bench leaderboard
Source files referenced inline:
- sweagent/agent/agents.py — main loop
- sweagent/tools/parsing.py — action parsers
- sweagent/tools/tools.py — tool dispatch
- tools/windowed/ — file viewer bundle
- tools/windowed_edit_linting/ — edit + lint bundle
- tools/search/ — search bundle
- tools/submit/ — submit bundle
- config/default.yaml — modern default config
- config/sweagent_0_7/ — classic ReAct config

If you found this helpful, let me know by leaving a 👍 or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! 😃

Table of Contents