DEV Community: Refact AI

Long-term memory in AI programming: Why your team needs an Agent that doesn't forget

Oleg Klimov — Fri, 11 Jul 2025 00:09:57 +0000

Ever feel like you’re stuck in a developer’s version of Groundhog Day?

You explain your code and project context to an AI, it helps for a while, but next session it’s like you’re meeting a stranger — all that context is gone.

You’ve probably seen this outside of coding too. You tell the AI you have two cats. Then later, while asking about pet food, it goes: “So, do you have dogs?” only to randomly mention your cats in a completely unrelated conversation about databases.

It’s frustrating. You want continuity in programming with AI, but it can’t even handle a task nearly identical to the one it solved yesterday. At some point, you start wondering why you bother at all. Maybe that surfing career is still an option… Sound familiar? Hold that thought.

How can we make AI remember and learn over time? The solution exists: continuous, long-term memory in AI Agents, and thankfully, today’s AI Agents can already use it in real workflows.

This article isn’t about the textbook definitions of memory types. Instead, we’ll explore the practical application of memory in modern agentic workflows, why knowledge management in AI programming is crucial, and how you can get an AI Agent with memory for you and your team.

What is memory in AI programming tools?

When we talk about “memory” in AI tools, it’s useful to distinguish between short-term and long-term memory:

Short-term memory – the information an AI model can hold within a single session (bounded by the model’s context window). Once you exceed it or start a new session, the older context falls out of scope.
Long-term memory – information that persists across sessions and beyond the context window. A true long-term memory in an AI coding Agent means it can retain and recall important facts and patterns from yesterday, last week, or last month, and use them to solve your task properly.

In essence, short-term memory gives an AI continuity within a single conversation, but long-term memory gives it continuity across conversations (e.g., within a shared workspace).

What is a long-term memory AI?

An example of AI with long-term memory is an AI Agent that, instead of wiping context after each interaction, retains critical information and reuses it to improve future responses. Thus, it might remember your key API endpoints, the fact that you’re migrating from one framework to another, the approach to some tasks, etc.

Crucially, long-term memory in AI isn’t about storing everything; it’s about capturing the right facts that will improve the AI accuracy and user experience later on. A well-designed memory system might log important details (e.g., “User’s preferred database is PostgreSQL”) and bring them up when relevant while ignoring irrelevant one-off prompts.

By persisting in such a context, an AI programming Agent could transition from a stateless tool to a learning collaborator. It would start to “know” your project and your team.

Long-term memory AI — simple explanation

A well-known demonstration comes from the paper “Generative Agents: Interactive Simulacra of Human Behavior” (ACM UIST 2023). Its core idea is a memory stream — a natural-language log of everything the Agent observes, does, and thinks.

The behavior cycle is simple:

Perceive: write each new observation to the memory stream.
Retrieve: pick the memories most useful right now, based on recency (time decay), importance (LM-rated 1-10), and relevance (semantic similarity to the current context).
Reflect: cluster recent entries into broader takeaways (“I enjoy helping people”) that are saved back to memory.
Plan: turn goals and reflections into an ordered set of actions.
Act: execute the next step, record the result back to memory.

The fact that your AI Agent doesn’t remember is a technical limitation of AI’s context window. As we’ve already understood, the model isn’t actively maintaining a knowledge base of what happened before. In practice, that means all the project structure, code details, and requirements you fed it earlier may be absent today.

What is more crucial is that AI forgets the right decision it made. So if it wrote for you some brilliant piece of code or fixed an old bug doesn’t mean it would really understand how to do the same later on. Maintaining conversation memory with an AI Agent or preserving the Agent’s context and understanding across sessions is impossible without special handling.

In modern development, if an AI Agent can’t accumulate experience, learn from mistakes, or follow a development narrative over time, it may be less useful in long-living coding projects and large codebases.

Why developers need Agent workflow memory

Lack of long-term memory actively hinders productivity:

Every session starts from zero. Yesterday’s complex deployment explanation? Gone.
No project context. Custom utilities, deprecated patterns — forgotten.
No learning across time. Good answers aren’t reused; the Agent never gets smarter.
No team-wide learning. A bug fix found by one Agent isn’t shared with others.
No shared access to data/docs. Each dev’s context stays local.
Knowledge leaves with people. Departing teammates take their AI experience with them.
Poor onboarding. New hires start from scratch; the Agent can’t surface prior answers.

Given these limitations, it’s clear that enabling long-term memory in AI is the missing piece to make AI Agents truly effective.

Team knowledge in AI Agent: The future of software development

Imagine an Agent for IDE that:

Records its moves,
Notes which approaches succeed or fail,
Stores those as memory items,

Then, when faced with a similar task, retrieves relevant past attempts and solves the new problem with full context plus experience.

Extend that to a team: each Agent’s personal memory stream merges into a shared knowledge base.

The whole team gains automatic access to the best solutions, patterns, and gotchas discovered by anyone’s AI Agent on the project. A bug fix found by one Agent becomes instantly available to another even across team members. This collective memory becomes a form of AI-driven knowledge transfer across the team — a self-updating wiki of coding wisdom for the project at hand.

AI knowledge management platform for long-term memory

The most efficient organization is a cloud workspace:

Admins manage memory items.
Agents connect to shared databases, docs, APIs.
Each Agent is both contributor and consumer of the knowledge base.

By coordinating memory at the team level, you ensure AI output consistency. And the payoff is huge: fewer duplicated efforts, fewer recurring bugs, and onboarded developers get up to speed faster with the help of an AI that actually knows the project.

How many times have you wished “Didn’t we solve a similar problem last month?” - now the AI can instantly answer that and even apply the previous solution. Ultimately, when AI Agents gain long-term memory and shared team-wide data access, they stop repeating mistakes and start accelerating progress. Developers don’t just get a helpful AI digital twin: they get a system that remembers, learns, and collaborates at scale.

To sum up: integrating long-term memory and shared data access at the team level becomes an anti-frustration pill for AI Agents, letting them pull in relevant project knowledge whenever developers need it while programming with AI.

Programming without memory vs. with memory

Aspect	❌ Without a memory layer	✅ With memory
Context handling	Requires repeated explanations	AI Agent remembers past conversations and task states; resumes with full context
Access to shared project resources	Each dev manually re-uploads docs or snippets; context stays local	Team Agents query shared DBs, docs, and APIs via a connected memory layer
Team collaboration	No knowledge sharing; best practices are isolated	Agents share successful approaches automatically
Knowledge reuse	Agent re-learns similar tasks from scratch	Proven solutions are reused and instantly applied across sessions & devs
Code quality over time	Inconsistent; depends on each dev’s prompting skills	Agents generate code aligned with project standards and improve with use
Onboarding new members	New devs start from zero; must learn project & AI practices	Agents preload project knowledge & conventions; useful from day 1
Knowledge loss when teammates leave	Experience with AI programming is lost	Persistent memory retains learnings for future teammates

If you want to maintain AI context and understanding across programming sessions, you need an AI Agent with memory.

Implement memory in your AI Agent for programming

Forgetting is not a given. When choosing an AI coding tool, ask:

Does it forget everything when you close the window, or does it learn and improve with you?

That single difference defines whether it’s the best AI Agent for software development or just another tool.

The era of long-term memory in AI Agents is just beginning. These Agents remember your codebase, follow your standards, and re-apply successful solutions across tasks and teammates.

You can be among the first to adopt an AI Agent with memory. It runs inside your IDE with no complex setup, but the impact is real.

Want an AI Agent with memory for a team of 3 + developers? Fill out the form to join the Waitlist.

SWE-bench Multimodal is the benchmark that JavaScript devs might explore

Oleg Klimov — Thu, 26 Jun 2025 22:26:20 +0000

We recently ran Refact.ai Agent on SWE-bench Multimodal, a benchmark that honestly doesn’t get enough attention. It’s one of the few evaluations that test if AI can fix bugs described using screenshots (e.g., UI mockups, diagrams, error messages, etc.).

Unlike SWE-bench Verified (Python-only), the Multimodal version focuses on web libraries and frontend tasks. That makes it more representative of real-world debugging, especially in JavaScript environments where bugs are often reported this way.

So, I'm here to share that our AI Agent Refact.ai has achieved #1 on SWE-bench Multimodal. It solved 184 out of 517 tasks (35.59%) and did it fully autonomously. We also scored highest on SWE-bench Verified among AI Agents solving tasks in pass@1 (in one attempt).

The full SWE-bench pipeline we used is open-source and fully reproducible.

You can run Refact.ai in VS Code, JetBrains, or self-host: it can fix the toughest bugs, solve routine dev tasks you delegate, build working solutions from scratch, and help you do more with less manual coding!

In this post, I’ll walk through how we achieved top results on SWE-bench and the tech behind the runs.

#1 AI Agent in SWE-bench Multimodal is Refact.ai

SWE-bench Multimodal tests whether an AI Agent can handle GitHub issues that include both text and visuals, such as:

Screenshots of bugs or interface issues
Design mockups or wireframes
Diagrams explaining desired functionality
Error messages with visual context

It covers tasks from libraries used in web interfaces, diagramming, data visualization, syntax highlighting, and more.

We ran this benchmark fully autonomously using a locally modified version of the official sb-cli to enforce single-threaded execution.

Also, we didn’t use extra agentic tools like debug_script or strategic_planning, which were part of our earlier Verified runs.

Evaluation results:

Total	Solved	Solved (%)	Not solved	Failed runs
517	184	35.59%	326	3

Achieving #1 on SWE-bench Multimodal makes Refact.ai a top-tier AI Agent for JavaScript tasks.

Combined with our leading results on Python-based SWE-bench, it confirms the Agent’s ability to deliver high-quality results across programming languages.

Refact.ai’s open-source approach to SWE-bench

The new run introduced a key upgrade: Anthropic’s Claude 4 Sonnet as the core model, bringing a notable boost in reasoning and code generation. With it, Refact.ai Agent reached a 74.40% score — surpassing Refact.ai's best SWE-bench Verified score of 70.4% with Claude 3.7 Sonnet.

Beyond that, this milestone builds on everything we’ve learned from earlier SWE-bench runs.

Our approach remains focused on reliability and step-by-step problem solving. Key elements of the SWE-bench Verified setup included:

Open-source Agent prompt, available on GitHub.
Claude 4 Sonnet as a core model
A debug_script() sub-agent that fixes bugs and can modify/create new scripts
Extensive guardrails to catch when the model is stuck or going off track, and to redirect it back on course
Incremental improvements built on our previous Claude 3.7 run

How does Refact.ai Agent solve the SWE-bench Verified tasks? It follows a four-step strategy defined in its system prompt.

The Agent starts by exploring the problem: using tools like cat() to open files, search_symbol_definition(), search_pattern(), etc. to locate relevant code. The Agent also uses compress_session(), ensuring it gathers the right context before attempting any changes.

At step two, the Agent reproduces the issue. It runs all existing tests to ensure a clean baseline, writes a script that triggers the bug (covering all possible edge cases), sets up the environment, and runs the script via shell("python ...") to confirm the failure. Then debug_script() takes over — a custom sub-agent that uses pdb to debug, modify, and generate scripts. Powered by Claude 4 with o4-mini for summarizing the debug info, it’s called at least once — and up to three times — per task. In practice, it was really helpful for digging into the problem source.

Once complete, the Agent plans and applies the fix based on the debugging report. It updates project files directly, without creating patches and diffs. In the earlier run, this step used a separate strategic_planning() tool. With Claude 4 Sonnet, that’s no longer needed — the model’s reasoning is strong enough to handle this job on its own. Finally, the Agent checks its work: re-runs the reproduction script and the project’s existing tests to validate the fix. If all tests pass, it uses compress_session() to offload any debug or temporary files and optimize context usage before ending the run.

Throughout the run, automatic guardrails help keep the Agent on track. These are mid-run messages, inserted into the chat as if from a simulated “user” when the model gets stuck or makes mistakes. A script statically monitors Claude 4’s outputs, and when needed, injects messages to guide the model back on course. For example, it may remind the model to open all visited files after debug_script(), or to follow correct implementation rules after planning. These small actions make a big difference in stability.

The entire run is fully autonomous: no manual inputs, no retries. Each task runs in a single session, with the Agent self-correcting and managing context to stay efficient and produce a single correct solution.

SWE-bench Verified vol.2: What changed in the new run

Several upgrades helped push Refact.ai Agent from 70.4% to 74.4% on SWE-bench Verified:

Model upgrade to Claude 4 Sonnet: Replaced Claude 3.7 with the more advanced Claude 4 Sonnet.
Removed strategic_planning(): Previously, this tool (powered by o3) reasoned over debug_script() output and modified files. This is now fully handled by Claude 4 Sonnet.
New safeguard for file overload: Agent used to open entire folders using cat, leading to context overflow. We’ve added a limit: if a folder contains more than 5 files, the Agent returns an error and asks for one-by-one access: “Too many files were requested. Please open files one by one.”
Extra guardrail at the end of the session: “Check the last time that all changes applied to the project directly and all pre-existing tests aren’t broken.”
Larger context for search_pattern().
Minor tweaks to debug_script() prompt.

All these improvements work together to make Refact.ai Agent more robust and efficient. Moving to Claude 4 Sonnet significantly boosted reasoning ability and allowed us to simplify the agent’s loop while still solving more tasks. Meanwhile, the debug sub-agent and guardrails have been enhanced to ensure greater reliability throughout each run.

Evaluation results:

Total	Solved	Not solved	Solved (%)	Not solved (%)
500	372	128	74.40%	25.60%

From benchmark to your IDE

Ultimately, our focus isn’t only on benchmark scores — it’s on building an AI agent that truly works for real developers. The lessons learned and improvements made for SWE-bench are already finding their way into the product. That means when you use Refact.ai, you’re benefitting from the engineering approach that achieved this benchmark record.

Solves tasks autonomously, from start to finish
Fully understands your codebase, not just open tabs
Transparent by design — every step is visible and reversible
Integrates with dev tools (GitHub, Web, MCP, and more) to work across your system
BYOK-friendly or self-hosted if you want full control.

Refact.ai Agent is an AI Agent for software engineering you can trust — and guide when needed. Autonomous when you want it, collaborative when you step in.

If you’re ready to work with an AI that understands your environment, works across your tools, and earns your trust one task at a time — Refact.ai is ready for you.

Join our community to see what real developers are building end-to-end.
And of course, I'd be happy to answer any of your questions and chat. Thanks for reading!

How we built the open-source SOTA AI Agent on SWE-bench Verified

Oleg Klimov — Thu, 22 May 2025 19:02:36 +0000

We built the open-source SOTA AI Agent on SWE-bench Verified. Score: 69.9% — 349/500 tasks solved.

SWE-bench Verified is a refined version of the original SWE-bench, featuring 500 real-world GitHub issues, selected manually. It provides a more accurate and consistent way to evaluate how well AI agents can handle practical software engineering tasks.

Key elements that made this possible:

Extensive guardrails that step in when the model gets stuck or goes off track
debug_script() sub-agent that uses pdb to fix bugs and can modify/create new scripts
strategic_planning() tool powered by o3 to rethink and refine fixes when needed

The full pipeline we used for SWE-bench Verified is open-source — you can check it on GitHub. You can implement the same components and run the benchmark just like we did — to reproduce our Agent approach and score end-to-end.

Read on to see how Refact.ai became the best open-source Agent in SWE-bench Verified, and how the same ideas power real-world workflows in Refact.ai.

Model setup

Orchestration model: Claude-3.7
Debug sub-agent — debug_script(): Claude-3.7 + o4-mini
Planning tool — strategic_planning(): o3
pass@1: Each task is not attempted more than once.
Temperature: 0 for every Claude model.

For each SWE-bench Verified problem, Refact.ai Agent made one multi-step run aiming to produce a single, correct final solution. Our main goal was to achieve a maximum score in a single attempt.

Simpler, more effective Agent prompt

We revised the Agent prompt from our SWE-bench Lite run, where we top-ranked with a 59.7% score. Back then, it was more complex, and looking at how AI Agent behaved, we realized that simpler is better.

The new version is shorter and easier to follow. Since Refact.ai is open-source, you can explore it:

You are a fully autonomous agent for coding tasks.
  Your task is to identify and solve the problem from the given PR by directly changing files in the given project.
  You must follow the strategy, step by step in the given order without skipping.
  **Step 1: Explore the Problem**
    - Use `cat()` to open files. Use `search_symbol_definition()`, `search_symbol_usages()` if you know names of symbols.
    - Use `search_pattern()` for search by pattern, `search_semantic()` for a semantic search.
  **Step 2: Reproduce the Problem using `debug_script()`**
    - Find and run all project's existing tests to ensure the fix won't introduce new problems elsewhere.
    - Write a script that reproduces the issue. Cover as many corner cases as possible.
    - Set up necessary environment (e.g., create required folders or additional files) to run the script.
    - Run the script using `shell("python ...")` to verify that the error occurs and the script is correct.
    - After verifying that the script is correct and reproduces the issue, call `debug_script()` to debug it.
  **Step 3: Make a Plan using `strategic_planning()` and fix the Problem**
    - Open all new files mentioned in `debug_script()` report.
    - Call `strategic_planning()` once to think through and brainstorm the solution.
    - Update projects files directly without creating patches and diffs.
  **Step 4: Check and Improve Your Work by running tests**
    - Execute the script that reproduces the original issue.
    - Run project's existing tests again to ensure the fix doesn't introduce new problems elsewhere.

  **BEST PRACTICES**
    - You must follow the strategy (explore -> reproduce -> solve -> check), step by step in the given order.
    - Before each step explicitly announce your next actions. Make sure they are still align with the strategy.
    - Include your thoughts wrapped in <think></think> before any action.
    - %CD_INSTRUCTIONS%

Introducing a debugging sub-agent

When developers run into bugs, they investigate the code to figure out what went wrong. For our SWE-bench Verified run, that role was mimicked by debug_script().

debug_script() is a sub-agent inside Refact.ai that uses pdb to debug, modify, and generate scripts. It helps AI Agent gather key issue details:

Which files are affected
What actually caused the failure
And how it might be fixed.

Under the hood, debug_script() is powered by Claude-3.7, with o4-mini for summarizing debug info. We forced Refact.ai Agent to call this tool at least once — and up to three times — during each task.

In practice, this debugging sub-agent was really helpful for digging into the problem source.

Guardrails to keep AI Agent on track

The more we tried to constrain the model, the more it resisted. Since the goal was to solve each task in one go, we needed ways to make Agent more reliable.

We added automatic guardrails that kick in when the model gets stuck or makes mistakes. Essentially, these are helper messages, inserted into the chat mid-run as if from a simulated “user” — to nudge Agent back on track. It’s all automated: the script runs static checks on the main model’s (Claude-3.7) messages, and if it detects signs that something is going off track, it sends a message into the chat to help guide the model back in the right direction. These small actions make a big difference in stability.

Extra prompts after sub-agent calls:

After debug_script:

💿 Open all visited files using `cat(file1,file2,file3,…)`!

After strategic_planning():

💿 Now implement the solution above.

Reminders:

- Do not create documents, README.md, or other files which are non-related to fixing the problem.
- Convert generated changes into the `update_textdoc()` or `create_textdoc()` tool calls. Do not creat
patches (in diff format) or monkey-patches!
- Change the project directly to fix the issue but do not modify existing tests.
- Find and run all project’s existing tests to ensure the fix won’t introduce new problems elsewhere.
- Create new test files only using `create_textdoc()`.

Guardrails for Agent flow:

💿 Use `debug_script()` instead of `shell()`; dig deeper than previous attempts and set breakpoints inside the project.
💿 Do not call `debug_script()` more than three times.
💿 Call `strategic_planning()` before modifying the project.
💿 If you struggle to find the correct solution, consider using `debug_script()` or `strategic_planning()`.
💿 You cannot call {a_tool_name}\ while on the previous step—follow the strategy.

Strategic planning

The strategic_planning() tool comes in at Step 3 of the Agent prompt. It helps the model improve solution quality by reflecting on what went wrong — and what could be done better — based on the debug_script() report. It uses reasoning, powered by o3, and updates project files directly, without generating patches and diffs.

For this tool, we enforce one call per task.

Since the observation layer (search + pdb debug) was already quite efficient, strategy planning sometimes lagged. We tried the o4-mini and o3 models and found no obvious differences on a small subset of tasks. That said, both models were prone to overcomplicating tasks or not smart enough to identify the real root cause. Claude 3.7 might be a good candidate as a planning model in the future, given how well it did in other parts of the workflow.

Improvements over the SWE-bench Lite strategy

A 59.7% score on the SWE-bench Lite was a solid start. We shared the full technical breakdown in our earlier blog post — but even with a SOTA result, this run exposed a few weak spots.

Before tackling SWE-bench Verified, we prioritized addressing these issues found.

Tools-related updates:

Fixed a few tool-related issues, making the tools more tolerant of the model’s uncertainty when calling them.
Renamed tools — the model often skipped some tools as their names were unclear. New names:

definition()     -> search\_symbol\_definition()  
references()     -> search\_symbol\_usages()  
regex\_search()  -> search\_pattern()  
search()         -> search\_semantic()  
deep\_analysis() -> strategic\_planning()

Fixed the AST mechanisms inside refact-lsp that prevented decorated symbols from being parsed.
Resolved an issue where Agent didn’t wait for ast/vecdb to finish indexing the project.
We now mark line numbers, which add extra stability with retrieval tools like cat, search, etc.

Context-related updates:

Reduced the strength of chat compression. Claude 3.7 often tried to cat files already in context; instead of blocking it (which caused loops), we now allow the model to receive them again.
Encouraged the model to open whole files instead of many tiny cat calls to read a file line by line.
When the model opens large files, it noticeably degrades as the context size grows quickly. We continue to adjust this balance. During the SWE-bench Verified run, all these improvements were implemented.

What we tried that did not work

Not every experiment makes it to production. Here’s what we tested — and what we implemented:

Didn’t work	What works instead
A separate `critique` tool that allowed the model to assess its own changes.	Turns out, the model does better when it just runs tests and decides the next steps based on results.
A complex `strategic_planning()` tool flow with four steps: root-cause analysis → initial solution → critique → refined solution. It overcomplicated simple tasks and lowered success rates.	Now, `strategic_planning()` only generates a solution — and this works better.
Using a `pdb()` tool without a dedicated sub-agent. The Claude model preferred `shell()` over `pdb()`, so debugging rarely happened.	Introducing the `debug_script()` sub-agent made it reliable.
Running without sub-agents. As context grew, Claude 3.7 quickly became less accurate and stopped following instructions.	Letting sub-agents do their job.

From benchmark to real product

What makes Refact.ai stand out isn’t just the % of solved benchmark tasks — it’s how our AI Agent gets there. Our goal isn’t to win all leaderboards just for the sake of it, but to build an approach that actually works for real-world programming.

That’s why SWE-bench Verified is also a way to test and improve the actual engineering flow of our product. Many of the updates we made for the run (see: Tools-related updates, Context-related updates) are already shipping in Refact.ai.

The guard mechanisms are another example: in the product, we already have these helper messages that AI Agent automatically sends itself after calling certain tools. Like with debug_script(): it gets the tool output, and also a static instruction to open all the related files mentioned. So, these guard mechanisms are already part of specific flows. And we’re planning more, incluiding chat-wide checks to spot earlier off-tracks and react to them.

We’re also updating the AI Agent prompt used in Refact.ai for VS Code and JetBrains to improve product efficiency for our users. Notably, strategic_planning() isn’t (and won’t be) called by default in pluggin — it’s heavy on coins spent and not always necessary, since the main model is often enough to solve the task. That said, if you think your task needs deeper reasoning, you can still call it manually in chat with @. Just keep in mind it’s coin-expensive.

Refact.ai Agent solved SWE-bech Verified fully autonomously — but in real-world use, of course, developers often want more control. That’s why Refact.ai offers flexible interaction with manual overrides: you can delegate tasks to AI Agent, while it lets you preview and guide the process.

That reflects our philosophy: autonomous AI Agent for programming you can trust — and control when you need to.

Final score

Out of 500 tasks in SWE-bench Verified:

🥇 Solved: 349 (69.8% resolve rate)
Not solved: 151 (30.2%).

Evaluation results

Total Instances	Solved	Not solved	Solved (%)	Not solved (%)
500	349	151	69.8%	30.2%

Resolved by Repository:

astropy/astropy: 9/22 (40.91%)
django/django: 165/231 (71.43%)
matplotlib/matplotlib: 20/34 (58.82%)
mwaskom/seaborn: 0/2 (0.0%)
pallets/flask: 1/1 (100.0%)
psf/requests: 6/8 (75.0%)
pydata/xarray: 18/22 (81.82%)
pylint-dev/pylint: 4/10 (40.0%)
pytest-dev/pytest: 16/19 (84.21%)
scikit-learn/scikit-learn: 28/32 (87.5%)
sphinx-doc/sphinx: 28/44 (63.64%)
sympy/sympy: 54/75 (72.0%)

Resolved by Time:

2013: 3/3 (100.0%)
2014: 2/2 (100.0%)
2015: 0/1 (0.0%)
2016: 2/2 (100.0%)
2017: 13/16 (81.25%)
2018: 14/24 (58.33%)
2019: 73/98 (74.49%)
2020: 79/108 (73.15%)
2021: 54/86 (62.79%)
2022: 71/102 (69.61%)
2023: 38/58 (65.52%)

Get Refact.ai Agent for your IDE

Refact.ai is an autonomous AI Agent that automates programming tasks — helping developers and IT teams move faster:

With Refact.ai in your IDE, you get:

Real automation that boosts productivity by 10x
Seamless integration with your codebase, workflow, and dev tools
A digital twin that handles your busywork and lets you focus on big things.

Available to everyone: install Refact.ai for VS Code or JetBrains today and feel the real impact in your everyday programming.

#1 on SWE-bench lite, achieved fully autonomously by open-source Refact.ai Agent

Oleg Klimov — Mon, 05 May 2025 21:45:32 +0000

Refact.ai Agent has achieved the #1 score on SWE-bench Lite — solving 179 out of 300 tasks, for a 59,7% success rate. Our approach: fully autonomous AI Agent for programming, no manual intervention needed.

SWE-bench Lite is a benchmark that evaluates LLM-based systems on real GitHub issues from popular open-source Python projects. Each task requires applying a bug fix or feature implementation, then validating the result through test execution. This makes the benchmark particularly valuable for understanding how AI tools will perform in actual production environments.

Agent setup

Refact.ai Agent takes a fully autonomous, iterative approach. It plans, executes, tests, and self-corrects — repeating steps as needed to reach a single correct solution with no user input. So, the benchmark setup was designed to reflect our autonomy-first philosophy:

Prompt strategy: Defines the Agent’s behavior and high-level task-solving logic. Open-source and available on GitHub.
Model: Claude 3.7 Sonnet — responsible for orchestration and decision-making.
Execution layer: refact-lsp, a backend that connects the model to tools and the environment.
deep_analysis() tool: Enhanced reasoning, powered by o4-mini. Tool suite for repository exploration, code modification, and testing. Used dynamically based on task needs.
Step cap: 60 agent steps (each one a discrete action) per task.

What sets Refact.ai apart is that our AI Agent independently drives the entire process. While some solutions take a semi-autonomous approach — requiring to manually invoke tools and guide the agent — Refact.ai Agent operates independently from start to finish.

Prompt strategy

Refact.ai’s SWE-bench Lite prompt follows a clear workflow:

Describe the problem.
Investigate the repo.
Create and run problem reproduction script.
Make up a plan (using deep_analysis() powered by o4-mini) and apply changes.
Run tests and evaluate changes (including optional reasoning with deep_analysis())
Repeat 4 and 5 steps until the problem is solved.

This workflow serves as high-level guidance, not hard rules. Refact.ai Agent uses it to form its own strategy — repeating, skipping, or reordering steps based on task context.
For each SWE-bench problem, Refact.ai Agent made one multi-step run to produce a single, correct final solution.

Claude 3.7: Model choice

Refact.ai uses Claude 3.7 Sonnet with sampling temperature 0.0 as its core model for SWE-bench Lite. It demonstrated exceptional capabilities for autonomous workflows: following multi-step instructions, understanding complex codebases, and maintaining context across long interactions.

We’ve previously paired Refact.ai with Claude 3.7, solving the Polyglot benchmark, where it reached 93.3% with Thinking Mode and 92.9% without — the highest known scores to date on that task set.

Deep-analysis() tool

One of the key features in Refact.ai’s approach is the deep_analysis() tool. It adds a structured, three-step reasoning process that improves solution quality at critical moments in the task flow.

deep_analysis() is powered by o4-mini — a small, fast reasoning model that handles the cognitive load of problem-solving so Claude 3.7 can focus on orchestration.

The prompt for deep_analysis() tool follows the pattern [also on GitHub]:

Solution generation ”Get the initial solution.”
Critique ”Please critique the solution above. Identify any weaknesses, limitations, or bugs. Be specific and thorough in your analysis. Remember, that the final solution must be minimal, robust, and effective.”
Refinement ”Please improve the original solution based on the critique. Provide a comprehensive, refined solution that addresses the weaknesses identified in the critique while maintaining the strengths of the original solution.”

This structured loop is normally triggered during Step 4 of the Benchmark prompt — when planning and applying code changes. But in fact, Refact.ai Agent decides when to use this tool on its own.

Completing the benchmark, we observed that Agent sometimes called deep_analysis()multiple times — first during planning, then again when testing and evaluating results. In other cases, it skipped the tool entirely. This proves that Refact.ai Agent doesn’t follow a rigid script, but instead prioritizes its own strategy to get the task done right.

Tools, tools, tools

Refact.ai Agent has access to a variety of tools that allow it to interact with the entire development environment for tasks-solving.

Code exploration: search(), regex_search(), definition(), references(), tree(), cat()
Editing: create_textdoc(), update_textdoc()
Shell execution: shell() — used to run Python tests and verify solutions.

These tools enable AI Agent to navigate codebases, understand dependencies, make precise changes, and verify that its solutions work correctly. It uses them autonomously, what and when needed.

Although Refact.ai Agent can also interface with real-world tools (GitHub, Docker, PostgreSQL, etc.) and 1000+ tools via MCP servers, these integrations weren’t used in the benchmark run — but are part of standard workflows in user environments.

60 steps cap

Claude 3.7 Sonnet has 60 steps to complete a task. A step = AI action, such as modifying a file, listing directories, or running tests. AI Agent strategically decides how to use these steps, leading to clear, controlled solutions.

Final SOTA score

Out of 300 tasks in SWE-bench Lite:

🥇 Solved: 179 (59,7% resolve rate)
Not solved: 121 (40,3%).

Refact.ai Agent even managed to solve two SWE-bench tasks that no other listed agents have (django-12589, sympy-21627) — supposedly, thanks to the o4-mini model’s reasoning capabilities.

Evaluation results

Total	Solved	Not solved	Solved (%)	Unresolved (%)
300	180	120	60,0%	30,0%

Resolved by repository

astropy/astropy: 3/6 (50.0%)
django/django: 78/114 (68.4%)
matplotlib/matplotlib: 11/23 (47.8%)
mwaskom/seaborn: 2/4 (50.0%)
pallets/flask: 0/3 (0.0%)
psf/requests: 5/6 (83.3%)
pydata/xarray: 2/5 (40.0%)
pylint-dev/pylint: 3/6 (50.0%)
pytest-dev/pytest: 10/17 (58.8%)
scikit-learn/scikit-learn: 17/23 (73.9%)
sphinx-doc/sphinx: 6/16 (37.5%)
sympy/sympy: 43/77 (55.8%)

Looking forward

Refact.ai’s performance on SWE-bench Lite demonstrates that AI agents are becoming increasingly capable of handling real-world software engineering tasks autonomously — not just generating code, but planning, debugging, testing, and refining it with minimal human input.

Our next step is evaluating Refact.ai Agent on SWE-bench verified — a benchmark with more rigorous testing.

All of this is part of our open-source commitment. Developers can explore the system, understand how autonomy is implemented, and even contribute. We believe that as the baseline work of software development shifts to AI, human engineers will be free to focus on the more interesting and creative parts of the job — and invite developers to build the future of programming together.

Why does SWE-bench matter for developers?

This isn’t just about ranking highly on a benchmark — it’s about real-world coding impact. Refact.ai Agent helps developers and software companies:

Automate repetitive tasks across the SDLC
Focus on core work while the AI handles the rest
Deliver faster with the AI Agent working alongside you in your IDE
Delegate with confidence, knowing the AI writes reliable, tested code.

Get Refact.ai Agent for your IDE

Vibe coding is the future of software development — get it today.
Refact.ai’s autonomous AI Agent works like a senior developer in your IDE:

Works inside your workflow & with your dev tools
Boosts productivity x10 with real automation
Handles coding while you focus on core work
Available to everyone in IDE.

Try open-source Refact.ai Agent for programming in VS Code or JetBrains and let us know what you think!

Top-10 Tips for Conscious Vibe Coding

Oleg Klimov — Tue, 01 Apr 2025 18:05:50 +0000

Welcome to the world of “Vibe Coding” - a way that’s transforming how we build software. This approach shifts the traditional focus from algorithmic precision to intent-driven development, where developers and non-technical professionals describe what they want rather than how to achieve it.

You can read this text on our blog: Top-10 Tips for Conscious Vibe Coding

What is Vibe Coding?

The term “Vibe Coding” entered the tech lexicon just months ago, courtesy of Andrej Karpathy, former Director of AI at Tesla and ex-OpenAI researcher. On February 3, 2025, Karpathy tweeted:

“There’s a new kind of coding I call “vibe coding”, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.”

This simple yet profound statement captured the essence of a developing trend: allowing AI to handle the technical complexities while humans focus on ideation and direction. Since then, this term has exploded in popularity, with thousands of developers and entrepreneurs embracing this new paradigm.

Top-10 Essential Tips for Successful Vibe Coding

If you’re intrigued by the concept and eager to dive in, here are some practical 10 tips to enhance your Vibe Coding experience:

1. Give clear and concise instructions

Your AI isn’t a mind reader. While it’s tempting to provide vague directions and expect magic, specificity is your friend. Instead of saying “build me a habit tracker,” try “create a NextJS-based habit tracker with daily streak tracking and a clean, colorful UI.” The more precise your instructions, the better the output.

2. Leverage different AI models

Use advanced reasoning models like GPT-o3-mini specifically for planning (in Refact.ai, just click ‘Think’ button’) and creating detailed Product Requirements Documents (PRDs), then pass these comprehensive plans to execution-focused models like Claude 3.7 to handle the actual coding implementation. The planning models excel at high-level thinking while execution models efficiently translate those plans into functional code.

3. Accept and iterate, don’t perfect

The beauty of Vibe Coding lies in rapid iteration. Don’t get stuck trying to make everything perfect in one go. Accept what the AI generates, test it, and then refine your instructions based on what you see. This cycle of acceptance and iteration leads to faster development and better results. Refact.ai enhances this workflow with powerful tools that allow the Agent to close the feedback loop autonomously. The Chrome tool enables the Agent to open websites or localhost servers, click through interfaces, and capture screenshots to use as additional context. Similarly, the pdb tool allows the Agent to debug code execution in real-time, identifying and resolving issues without constant human intervention. These capabilities mean the Agent can not only generate code but also test it, observe its behavior in action, and incorporate visual feedback into subsequent iterations - dramatically accelerating the development cycle while maintaining quality.

4. Structure your code in separate files

Instruct a model (we recommend Claude 3.7) to generate code in a modular way, using separate files instead of one large script. Keeping everything in a single file can clutter the context window and slow down responses. Instead, request well-organized, smaller files for better clarity. Always remind the Agent to clean the generated code and unnecessary files at the end of every iteration. Also asking to update the README file after every change can be beneficial.

5. Context is key

Mastering context management is the secret weapon of vibe coders. As Simon Willison, Django’s co-creator, points out:

“Most of the craft of getting good results out of an LLM comes down to managing its context - the text that is part of your current conversation.”

Your AI assistant only knows what you’ve shared with it, so deliberately controlling what information is in context for each request improves results. Refact.ai takes this to the next level with its RAG-based system which allows the Agent to select the most necessary files and also open them using commands like “tree” to view the directory structure and “cat” to read file contents. This autonomous context management gives the AI a comprehensive understanding of your codebase without requiring you to manually guide it through every file. The Agent intelligently determines which code components are relevant to your current task and brings them into context, ensuring generated code that’s tailored to your specific project architecture and conventions.

The best way to work with Refact is to simply indicate 1-2 initial files using the @file command, and then let the system get the context from other files by itself. This efficient workflow combines the precision of targeted context with the power of autonomous exploration, creating the ideal balance for productive vibe coding.

6. Learn from the Community

Community is an incredible way to learn and exchange insights. Follow key figures, join forums, and participate in discussions to pick up new techniques. For example, join the Refact.ai Discord community where we gathered developers who are excited about AI-driven programming.

7. Create a comprehensive prompt library

Develop a collection of proven prompts that work well for specific tasks. Having templates for common operations like “implement authentication,” “create a responsive navbar,” or “set up database models” can significantly speed up your workflow.

8. Understand the basics

While Vibe Coding reduces the need for deep technical knowledge, understanding fundamental concepts is still valuable. Knowing the difference between frontend and backend, basic data structures, and simple programming patterns will help you communicate effectively with your AI tools.

9. Document the process

Create a personal library of effective prompts, patterns, and terminology that consistently yield quality results with your AI assistant. Document successful approaches and your reasoning behind key decisions—this documentation becomes invaluable as projects grow in complexity. Maintain a glossary of project-specific terms that your AI responds well to, noting how slight variations in wording can dramatically affect output quality. This systematic approach transforms vibe coding from a haphazard process into a reproducible methodology that improves with each iteration.

10. Engineering discipline still matters

Even with powerful AI, engineering fundamentals remain essential. Here are several key pieces of advice from Refact.ai Co-founder Oleg Klimov, who emphasizes the importance of maintaining solid software development practices:

Know your stuff as an engineer. AI can accelerate development, but understanding core software engineering principles ensures you produce high-quality code.
Pay attention to generated code. AI-generated code will form the core of your project, so review, refine, and optimize it regularly.
Be aware of AI limitations, so learn models’ weaknesses and adjust accordingly.
Reduce code size. Simplify your codebase by eliminating redundant parts at every opportunity
Write tests and commit frequently. Establish working checkpoints so you can revert to stable versions if needed.

AI Agent Useful Case Study

User of Refact.ai, @Ukro built a complete IoT cloud monitoring Django application with the Refact.ai Agent doing 99.9% of the work. This real-world project included sophisticated features like:

Permission management for various modules
Automatic real-time updates via Ajax
UI translation editing
Dark/Light mode
Custom OAuth2 SSO implementation

Amazingly, @Ukro only needed to write the models.py file himself - everything else was handled by the AI agent. While Agent worked, @Ukro was busy with his main job and family, as he shared. Here are some screenshots of what the AI Agent built for him:

Potential risks of vibe coding

Vibe coding comes with significant security concerns. Developers are integrating code they don’t fully understand, leading to potential vulnerabilities. Two primary security issues:

Blind trust: Developers implementing AI-generated code without understanding its security implications
Dependency bloat: AI tools often add unnecessary dependencies that expand the attack surface.

In its recent article, Indie Hackers gave the example of the non-technical guy @leojr94_ who capitalized on the vibe coding trend by building an app in public and earning quick revenue, then hackers began exploiting numerous security vulnerabilities in his app.

To mitigate these risks:

Always review generated code for security best practices.
Understand core algorithms even if you didn’t write every line.
Scan for potential vulnerabilities before deploying.
Minimize dependencies when possible.

And here are some more additional recommendations from Indie Hackers:

Don’t make your API keys public, store hardcoding keys in a secure environment that only the application can access.
Don’t make an easily bypassable paywall

Additionally, non-technical founders should balance their “ship fast” mentality with basic security measures or partnerships with security-focused developers.

What users think about Vibe Coding

“Vibe coding” has exploded as a search term, evolving from niche concept to mainstream tech conversation.

Yet important distinctions are emerging in the discourse — some developers pointedly differentiates between intention-based “vibe coding” and potentially problematic “blind coding” where developers implement AI solutions without understanding them. The community response remains decidedly mixed: skeptics outnumber enthusiasts in recent discussions, with many experienced developers expressing reservations about reliability and maintenance implications. Join the discussion in our Discord community and share your thoughts!

Interestingly, the market is responding in polarized ways: some organizations are actively recruiting for “Vibe Coder” positions.

While entrepreneurial developers have launched services specifically targeting the cleanup of problematic code generated through vibe coding approaches.

Difference between AI Tools for Vibe Coding

This quadrant provides an overview of various AI coding tools, highlighting the key differences between those associated with “vibe coding” (such as Lovable, Bolt, and v0) and more developer-centric solutions like Refact.ai.

It is a combination of two quadrants originally shared by Henry Shi (co-founder of Super.com) and Luca Mezzalira (Principal Serverless Specialist Solutions Architect at Amazon Web Services) who is also leading ‘Dear Architects’ newsletter.

Non-technical people focused tools:

Primarily designed for non-technical users (product managers, designers)
Focus on natural language understanding
Prioritize ease of use over technical precision
Best for simple applications and prototypes (MVPs)

Developer-centric solutions (Refact.ai)

Built for professional developers
Balance between natural language and technical precision
Integrate with existing development workflows
Support complex codebases and enterprise requirements

Conscious vibe coding approach

Vibe coding represents a paradigm shift in software development — one that promises greater accessibility, faster development cycles, and more focus on creative problem-solving. However, it also demands a new set of skills focused on effective communication, critical evaluation, and security awareness.

Tools like Refact.ai are leading the way in conscious vibe coding, helping developers maintain quality and security while embracing the productivity benefits of AI collaboration. By developing a balanced approach that combines the intuitive nature of vibe coding with the rigor of traditional development practices, we can unleash the full potential of this new frontier.

As we continue to explore the possibilities of AI-powered development, the most successful developers will be those who view AI not as a replacement for technical understanding, but as a powerful collaborator that amplifies their existing skills and expands their creative possibilities.

Thanks for reading!

Get Refact.ai Agent for IDE | More links: GitHub, LinkedIn, X, Discord.

Refact.ai Agent scores highest on Aider's Polyglot Benchmark: 93.3% with Thinking Mode, 92.9% without

Oleg Klimov — Tue, 01 Apr 2025 16:34:54 +0000

Refact.ai Agent, powered by Claude 3.7 Sonnet, has achieved top performance on the Aider Polyglot Benchmark:

92.9% without Thinking
93.3% with Thinking.

This 20-point lead over the highest listed score (72.9% by Aider with Gemini 2.5 Pro) showcases Refact.ai’s superior autonomous coding capabilities. It handles programming tasks end-to-end in IDE — planning, execution, testing, refinement — and delivers a highly accurate result with zero human intervention.

📖 Explore the full article on our blog

Aider’s Polyglot benchmark evaluates how well AI models handle 225 of the hardest coding exercises from Exercism across C++, Go, Java, JavaScript, Python, and Rust. Unlike SWE-Bench, which focuses solely on Python and single-file edits within 12 repos, Polyglot tests AI ability to write and integrate new code across diverse, multi-language projects, making it much closer to real-world developer workflows.

Our approach: How Refact.ai achieved #1 in the polyglot leaderboard

Refact.ai Agent takes a fully autonomous, iterative approach. It plans, executes, tests, self-corrects, and repeats steps as needed to fully complete tasks with high accuracy — without human input.

Other approaches may follow a more manual, structured workflow, where some steps are controlled by human input + pre-defined scripts. Aider’s benchmark setup looks similar to this, following the trajectory:

Prompt → User provides task description → User manually collects and adds files to the context → Model attempts to solve the task → Then retries, controlled by the number of —tries parameter → User runs tests manually and, if they fail, provides feedback to the model → Model does corrections → Result.

This workflow requires ongoing user involvement — manually providing context, running tests, and guiding the AI. The model itself doesn’t form a strategy, search for files, or decide when to test. Of course, this approach saves tokens, but it also lacks autonomy.

Refact.ai has a different, autonomy-first AI workflow:

Prompt + tool-specific prompts → User provides task description → AI Agent autonomously solves it within 30 steps (i.e. searches for the relevant data, calls tools, decides when corrections are needed, runs tests, etc.) → Result.

So, Refact.ai interacts with the development environment, verifies its own work, and optimizes resources to fully solve the task end-to-end, delivering efficient and practical programming flow with full-scope autonomy.

This is much closer to real-world software development and vibe coding: developers can delegate entire tasks to AI Agent while doing other work, then simply receive the final result. It enables teams to get 2x more done in parallel, get the best out of AI models, and focus on big things instead of basic coding.

Key aspects of Refact.ai Agent approach:

1️⃣ 100% autonomy at each step

We at Refact.ai focus on making our AI Agent as autonomous, reliable, and trustworthy as possible. To complete tasks, it follows a structured prompt — since Refact.ai is open-source, you can explore our AI Agent prompt on GitHub. Below is an excerpt:

PROMPT_AGENTIC_TOOLS: | You are Refact Agent, an autonomous bot for coding tasks. STRATEGY Step 1: Gather Existing Knowledge Goal: Get information about the project and previous similar tasks. Always call the knowledge() tool to get initial information about the project and the task. This tool gives you access to memories, and external data, example trajectories (🗃️) to help understand and solve the task. Step 2: Gather Context Goal: Fully understand the task by gathering all important details. Do the following: Use tree() to check the project structure. Use locate() with the task description to find all necessary files automatically. Use additional tools like search(), regex_search(), cat(), definition(), and references() and any other to collect relevant information. Check any files that might indirectly relate to the task. Running available validation tools preliminary - is a good idea. Step 3: Make a Clear Plan Goal: Create a clear, validated plan before making changes. After gathering context (Step 2), create your plan independently. Only call deep_think() if: The task is very complex, You face complicated issues or loops, The user specifically asks for it. When using deep_think(), clearly state the task, what the outcome should look like, and carefully check the plan. After planning, use create_knowledge() to save important ideas, decisions, and strategies. Step 4: Make Changes Step-by-Step Goal: Follow the validated plan carefully. Make changes step-by-step using *_textdoc() tools. Clearly comment each step and its results. After making changes, go to the assessment step. Step 5: Check and Improve Your Work Goal: Make sure your changes are correct, complete, and really working. After changes: Run available build validation tools (cmdline_cargo_check(), cmdline_pytest*(), cmdline_npm_test(), etc…). If you find any issues, go back to Step 4 and fix them. If you cannot solve an issue even after multiple attempts, go back to Step 3 and use deep_think(). Do not pay attention to skipped tests, they don’t indicate any problems

In short: The AI analyzes files → creates a task plan → executes it → tests results → refines the solution — all without human intervention. It independently decides what to check, fix, and retry. You don’t need to force this manually or copy-paste anything.

2️⃣ Deep environment interaction

The model fully integrates with the development environment. It can autonomously read files, call tools, modify code, run tests, and more as needed, whenever needed.

3️⃣ Capped at 30 steps

AI Agent has 30 messages to complete a task. Here, a “message” doesn’t mean human input — AI only needs one: the task description. A message refers to each AI action, such as modifying a file, listing directories, or running tests.

This cap ensures efficiency while avoiding unlimited retries or token inflation. AI Agent decides when and how to use these messages, leading to clear, controlled solutions.

4️⃣ Self-testing with a possibility to check previous steps

Tests are essential for autonomous AI to validate correctness and ensure the final output is reliable.

Refact.ai Agent can run tests multiple times. It doesn’t just test the final output — it can go back to earlier steps, correct mistakes, and retry. If test failures reveal deeper issues, the model may decide to revise previous actions and then create and test a new solution. Plus, the model sometimes runs tests even before trying to write a solution to gather a better preliminary context.

Our approach may slightly differ from what Aider used for solving this benchmark, as our AI agent strategy and vision focus on:

Full autonomy of AI agent at every step, with no need for human input
Deep integration with tools and dev environment, enabling Agent to act independently
Self-testing autonomously when needed, revising earlier steps, testing mid-process, and running multiple checks for accuracy
Task completion within 30 steps

Refact.ai Agent improvements & benchmark progress

Two weeks ago, we published that Refact.ai had achieved 76.4% on Polyglot with Claude 3.7 Sonnet No-Thinking, already outperforming other listed results. However, the benchmark run revealed areas for improvement: in some cases, AI lacked enough steps to finish tasks, or skipped testing. Now, we introduced general improvements to Refact.ai Agent:

Doubled limit for AI to solve the task from 15 → 30 to allow more complex solutions to fully complete.
Enforced test execution, ensuring AI validates itself at least once per task.

These enhancements made Refact.ai Agent more effective for all users — also boosting the score from 76.4% to 92.9%, and reaching 93.3% score with Thinking.

Thinking vs. No-Thinking mode: What’s the difference?

“I think, therefore I am.” — René Descartes said. For AI, thinking isn’t philosophical, but practical. It refers to a state where models allocate additional computational resources for deeper reasoning.

Completing the Polyglot benchmark, Claude 3.7 Sonnet with Thinking received +4K extra tokens per response to use reasoning when requested. A quick recap of the results:

92.9% without Thinking
93.3% with Thinking — but it also consumed nearly twice as many tokens to complete the benchmark.

💡 For production use, we recommend enabling Thinking when working on longer, multi-step problems requiring deeper analysis, or handling high-risk code changes where additional reasoning helps avoid errors. You can try it with Refact.ai Agent.

Full breakdown, approach reveal & AI Agent insights we've shared on our blog: Refact.ai Agent scores highest on Aider's Polyglot Benchmark: 93.3% with Thinking Mode, 92.9% without.

Happy to discuss!

Get Refact.ai Agent for IDE | More links: GitHub, LinkedIn, X, Discord.

Our AI Agent + 3.7 Sonnet ranked #1 on Aider’s polyglot bench — a 76.4% score

Oleg Klimov — Tue, 18 Mar 2025 11:11:01 +0000

We’ve built an open-source AI Agent for programming in IDE, and it ranked #1 on Aider’s Polyglot Benchmark with a 76.4% score. The benchmark tests autonomous problem-solving on 225 of the hardest coding exercises across C++, Go, Java, JavaScript, Python, and Rust.

With this score, we’ve outperformed Aider’s own 60.4% with the same model, Claude 3.7 Sonnet, as well as DeepSeek Chat V3, GPT-4.5 Preview, and ChatGPT-4o.

How? Our AI Agent uses an iterative problem-solving approach: it writes code, validates it, fixes errors, and repeats until the task is solved correctly. No shortcuts — just reliable, production-ready results.

While SWE Bench gets a lot of attention, we’ve found Polyglot to be a far better measure of AI agents’ problem-solving abilities. It’s not just about passing tests or raw code generation — it’s about reasoning, precision, and delivering working solutions.

For more details on our tech setup and approach, check out our blog post.

We’d love to hear your thoughts and feedback!

About Aider’s polyglot benchmark

The benchmark evaluates how well AI can handle 225 of the hardest coding exercises from Exercism across C++, Go, Java, JavaScript, Python, and Rust. It focuses exclusively on the most challenging problems and measures:

Can AI write new code that integrates seamlessly into existing codebases?
Can AI successfully apply all its changes to source files without human intervention?

The full test set in the Aider polyglot benchmark repo on GitHub.

Why Polyglot > SWE Bench

SWE Bench is popular and often seen as a key benchmark for AI coding agents. However, it has significant limitations:

Only tests Python
Relies on just 12 repositories (e.g., Django, SymPy)
Benchmarked models are often pre-trained on these repos (skewing results)
Only one file is changed per task (unrealistic for typical development work)
Human-AI interaction is oversimplified (in reality, devs adjust how they collaborate with AI).

Because of these constraints, SWE Bench doesn’t truly measure an AI agent’s efficiency in software engineering workflows, not in a controlled environment.

In contrast, Polyglot is much more representative and realistic — it imitates the environments developers work in every day and reflects actual needs. It measures how well AI can autonomously interact with diverse, multi-language projects.

So, we’d like to thank Aider for introducing this comprehensive benchmark! It provides great insights into AI coding tools and helps drive better solutions.

Our approach: How Refact.ai achieved #1 in the leaderboard

Many AI Agents rely on a single-shot approach — get a task, generate code once and hope for the best. But LLMs aren’t all-knowing — they have limits and make mistakes, so their first attempt often isn’t accurate and reliable.

_Sure, you can pre-train models to ace specific tasks that trend on X… But what’s the point? This doesn’t translate to real-world performance.
_
At Refact.ai, we do things differently.** Our AI Agent uses a multi-step process:** instead of settling for the first attempt, it validates its work through testing and iteration until the task is done right.

We call it a feedback loop:

Writes code: The agent generates code based on the task description.
Fixes errors: Runs automated checks for issues.
Iterates: If problems are found, the agent corrects the code, fixes bugs, and re-tests until the task is successfully completed.
Delivers the result, which will be correct most of the time!

This drives the real value of AI — actually solving problems, not just scoring well on benchmarks.

Key features of Refact.ai’s autonomous AI Agent

Refact.ai’s advanced AI Agent thinks and acts like a developer, handling software engineering tasks end-to-end.

Autonomous task execution
Deep contextual understanding
Dev tools integration (GitHub, Docker, PostgreSQL, MCP Servers, and more)
Memory and continuous improvement
Human-AI collaboration
Open source

Try Refact.ai in your IDE

Vibe coding is the future of software development. Get 10x productivity with Refact.ai Agent in your IDE, handling complex programming tasks for you while you focus on core work.

✅ Available to everyone in VS Code and JetBrains.

We'd be happy if you test Refact.ai Agent for your software developmebnt tasks and share you opinion!

Developing AI Agent: SWE bench critics, speed-smart trade-off, and how to make the agent useful

Oleg Klimov — Thu, 08 Aug 2024 10:52:24 +0000

TL;DR: It's the early days of AI autonomously working as a software engineer. A human is still required to steer these capable systems. We at Refact.ai look at potential UI, speed, cost, and tradeoffs.

Not Solving SWE-Bench

Let’s quickly get the SWE benchmark out of the way. Some people still look at this benchmark as a primary measure of the capability of a coding agent – it isn’t.

SWE-bench consists of real issues in open-source repositories. But there are some problems:

All repositories in SWE-lite are in Python
There are just 12 of them (django, sympy, …)
All models are already familiar with these open source repositories
Correct output is always 1 file changed

So, the tasks that 99% of developers do every day aren't related to the benchmark at all.

But the problems with the SWE-bench don’t end there. Sure, there’s some value in just using the description given by a submitter on github, but it’s not realistic for a real human-machine interface. Humans can learn, and they will adapt to the AI they use. They will know what kind of tasks an agent can solve, what it cannot, and only give the tasks that are likely to succeed.

Then the question becomes: what kind of AI agent is best for a human to achieve high speed and comfort in programming?

And by the way, robots can learn, too, collecting information about the project over time, and that capability is not measured by SWE-bench – all the projects used are open source and already memorized by all the models. Not true for any new code or proprietary code that requires adaptation.

(Hey, if you are a student and you are looking for a good task to work on, make a better benchmark! It does not require a massive GPU farm, and will earn you a lot of citations for your paper. But find other people to cooperate on this, don’t do it on your own.)

Interactive vs. Long-Thinking AI Agents

Here’s one interesting trade-off: the current AI agents are slow.

The reason they are painfully slow is that they need to find a good cross-section of the repository at hand, generate and verify one or several solutions. Each of those operations requires at least one call to a state-of-the-art model (usually behind an API) with a lot of tokens. For this reason, agents are also expensive to run.

But let’s talk about speed. Is it possible to create a fast intelligent agent that can be used interactively? Does it mean it’s necessarily stupid, compared to a slow-thinking counterpart?

Is it possible to build a lovable product that embraces the slowness instead? This will require a completely different approach for a human: start several tasks, go have some tea, and when you come back some results will be ready. What are the requirements for this kind of AI agent to be usable?

For Refact.ai, what should be the priority? Let’s see if we can spot some trends.

Coding assistants started as an interactive feature: code completion. There were no slow-thinking products at all, and there are still none in coding assistants that are ready for prime time.

Over time, models become cheaper, faster and smarter. The reason why we are talking about a long-thinking agent for developers at all – is the “smarter” part. There’s no use in running a stupid model for a long time, it will just produce a lot of stupid.

But fixing the “smarter” part at a constant, as models become cheaper and faster, what will happen to a slow-thinking agent, will it become fast-thinking automatically, and therefore interactive? Or the slow AI agent will not ever go away: if a slow agent of today becomes fast tomorrow, we can always add more work and it will become slow again?

Refact.ai Approach

Well, we don’t know what people will like more.

On the technical side, the approach that should stand the test of time is to make building blocks that are useful in either scenario, and the difference between interactive and slow non-interactive regime is just the top-level system prompt.

An argument might go both ways: interactive agents are great because they are steerable along the way, and we want to keep a human in control. Non-interactive agents are also great: most tasks are not that complicated, just shut up and do the work.

We’ve figured the best way to decide these things is to speak to actual developers, and let them try an early version of Refact.ai AI agent. There’s a group of about ten enthusiastic people that we selected, we’ll try to know their problems and shape our future product together with these people.

💡To get early access to our AI agent and provide your feedback, fill out this form.

Refact.ai Future User Interface

Let’s talk about a less-than-realtime UI. What should it look like, if you want to launch a task and forget about it for 5-15 minutes?

Well, if they are so slow, you probably want to run several of them at the same time. You might also want to specify exactly what you want. If it takes considerable time to write a longer description, by the time the first one finishes or fails, your next task description might already be ready.

From that, several sessions need to work at the same time in the UI. Unfinished task descriptions or follow ups should reliably stay in place when you switch between sessions.

Furthermore, Refact.ai is an open source AI tool, and has nothing to hide. We’ll present all the internal model calls in the UI as well. How exactly a diff is generated by the autonomous agent? You will be able to see that.

One With the Machine

There are 4 main factors that make an AI agent useful:

Intelligence of the underlying models
Quality of the tools available to the model
Learning curve for a human
Speed and comfort for a human once adapted

We at Refact work mostly on tools: we have functions for Abstract Syntax Tree access, VecDB search, AI expert calls, and more. In this blog, the question is – what makes it an enjoyable experience?

Observable. It must be clear what’s going on with a task. That includes any failures: models sometimes go in circles, or go off the rails. This should all be transparent to the user, easy to understand. Based on that information, the user can correct their future tasks, or report the failure to improve the product.

Open and customizable. Other companies might hide the internals from you, but Refact is open source. If you see how an internal expert calls work in UI, you can understand the underlying technology, or even customize it.

Diffs. Software engineers use version control systems (like Git or Mercurial) every day. If there are a lot of changes produced by a single agent invocation, it needs to closely match the already familiar concepts: simple diffs for the interactive use case, ready-to-pull commits or small branches for the slow-thinking use case.

Memory. You should be able to tell how to solve a certain problem, and the model needs to memorize that, so it wouldn’t get stuck in a similar situation later. Or you can just tell your preferences directly.

Adaptation to the project. Over time, the AI agent should build a project-specific knowledge base, especially successful and unsuccessful attempts at doing things to improve the success rate in the future.

Enough with the Dreams, What’s Reality?

We want to build Refact.ai AI agent as a product people will love.

We have just posted a pre-release version in the VSCode Marketplace. It’s experimental, not yet recommended for anyone other than enthusiasts. But it won’t be experimental for long; on August 19, we plan a stable release for everybody!

You will be able to turn autonomous agent features off for each chat: in that case, Refact.ai will be an AI coding assistant with RAG tools.

When the agent feature is turned on, the chat will use a different system prompt and a different set of tools, and will find and patch relevant files on its own. We should be able to go through a couple of cycles of real-world feedback by then, to make the UI experience well polished.

The next version in September will offer Docker container isolation to perform automatic verification of the changes and memory. This way, Refact.ai’s autonomous agent will be available for Enterprise users

To know more, you can join Refact.ai Discord: there we’re sharing first updates on our AI agent and how to get early access to it.

Thank you for reading and best regards,
Oleg Klimov
CEO Refact.ai

Implementing RAG in Refact.ai AI Coding Assistant

Oleg Klimov — Tue, 28 May 2024 12:12:48 +0000

Retrieval Augmented Generation (RAG) is a technique to generate more customized and accurate AI suggestions by using the entire coding environment as a source of relevant context for code completions and chat queries.

In this blog post, I go through all the details about how it works and how we implemented RAG in Refact.ai — an open-source AI coding assistant for IDEs.

Introduction: Why RAG matters in AI coding

Imagine you have 2 files:

If you just have my_file.py supplied to the model, it doesn’t have any way to know to complete “say_hello” and that it needs a parameter to that function.

This problem of limited scope of AI models gets much worse as your project gets bigger. So, how to fetch it with the necessary information from your codebase, and do it in real time, and accurately?

It takes a specialized RAG pipeline for this work inside your IDE, and that’s the point of our new release.

Refact.ai Technology Stack

We use an intermediate layer between a plugin inside IDE and the AI model called refact-lsp. And yes, it works as an LSP server, too.

Its purpose is to run on your computer, keep track of all the files in your project, index them into an in-memory database, and forward requests for code completion or chat to the AI model, along with the relevant context.

refact-lsp is written in Rust, combining speed of a compiled language and safety guarantees. Rust is great: it has a library for almost any topic you can imagine, including vector databases and a port of tree-sitter — a library to parse source files in many programming languages.

The amazing thing about it: refact-lsp compiles into a single executable file that doesn’t require any other software to be installed on your computer — it’s self-sufficient! It means it will not interfere with whatever you are doing on your computer, and it will not break as you update your environment. In fact you don’t even see it, it gets installed together with the Refact.ai plugin in your favorite IDE.

AST and VecDB

There are two kinds of indexing possible: based on Abstract Syntax Tree (AST) and based on Vector Database (VecDB).

What is AST? We use tree-sitter to parse the source files, and then get the positions of function definitions, classes, etc. It is therefore possible to build an index in memory — a mapping from the name of a thing to its position, and make functions like “go to definition” “references” very fast.
What is VecDB? There are AI models that convert a piece of text (typically up to 1024 tokens) into a vector (typically 768 floating point numbers). All the documents get split into pieces and vectorized, vectors stored in a VecDB. These AI models are trained in such a way that if you vectorize the search query, the closest matches (in a sense of l2 metric between the vectors) in the database will be semantically similar or relevant to the query.
The problem with VecDB is that you need to vectorize the query as well, and that might take some time — not good for code completion that needs to be real-time.

It’s not an issue for a chat though: here you can play with both indexes using @-commands. More about it is described a few sections down.

VecDB: Splitting the Source Code Right

To vectorize a piece of text, we first need to make sure it’s a complete construct in a programming language, such as a single function, a single class. This way the semantic matching offered by VecDB will work best.

The easiest way to implement this is to use empty lines as a hint for the boundaries to split:

You can see in this example that the functions are separated by an empty line. We in fact use this method for text files without an available tree-sitter module.

But can we do better in splitting as well? Sure, of course we can! We can simplify the class by shortening function bodies:

If this skeletonized version of the class gets vectorized, you can see it’s much easier to match it against a query when you search for things like “classes that have jump in them”, compared to the situation when the splitter just vectorized “jump” function without its class.

AST: Simple Tricks to Make It Better

A library like tree-sitter can transform the source code into individual elements: function definitions, function calls, classes, etc.The most useful case: match types and function calls near the cursor with definitions.

See how it works:

refact.ai

If you click on the “FIM” (fill-in-the-middle) button, you can see these in the sidebar with a 🔎 icon.

But besides this simple matching, there are some tricks we can do. For the symbols near the cursor, we can first look at their type, and then go to the definitions. And for classes, we can go to their parent class. Those are simple rules that work for all programming languages!

Finally, treating all the identifiers as just strings, we can find similar pieces of code - it should have similar identifiers in them. A similar code can help a model to generate a good answer as well.

Post Processing

Let’s say you’ve found in the AST and VecDB indexes 50 interesting points that might help the model to do its job. Now you have additional issues to solve:

There might be just too many results to fit into the AI model’s context. There’s a budget measured in tokens to fit memory requirements, or latency requirements (code completion is real time), or model limitation.
The results themselves might not make much sense without at least a little bit of structure where it appears. For example, for a “function do_the_thing()” it’s important to show it’s inside a class, and which class.
There can be overlapping or duplicate results.

Those problems can be solved with good post-processing.

This is how our post-processing works:

It loads all files mentioned in the search results into memory, and it keeps track of the “usefulness” of each line.
Each result from AST or VecDB now just makes an increase in the usefulness of the lines it refers to. For example, if “my_function” is found, all the lines that define my_function will increase in usefulness, and the lines that contain the signature of the function (name, parameters and return type) will increase in usefulness more, compared to its body.
All that is left to do is to sort all the lines by usefulness, then take the most useful until the token budget is reached, and output the result:

refact.ai

You see there, post-processing can fit into any token budget, keeping the most useful lines, and rejecting less useful ones, replacing them with ellipsis.

One interesting effect is skeletonization of the code. As the budget decreases, less and less lines can make it into the context, our post-processing prefers to keep some of the code structure (which class the function belongs to) over the body of that function.

Oh Look, It’s Similar to grep-ast!

Yes you are right, it is! In fact we took inspiration from grep-ast, a small utility that uses tree-sitter to look for a string in a directory, and it also prefers to keep the structure of code so you can see where logically in the code your results are.

It doesn’t have a notion of token budget though, and it’s written in python so it’s not very fast, and it doesn’t have any indexes to search faster.

RAG in Refact.ai Chat with @-commands

In Refact.ai, we’ve made RAG support for chat LLMs too. It can be used with commands to add some important context.

@workspace - Uses VecDB to look for any query. You can give it a query on the same line like this: “@workspace Frog definition”, or it will take any lines below it as a query, so you can search for multi-line things like code pieces.
@definition - Looks up for the definition of a symbol. For example, you can ask: “@definition MyClass”
@references - Same, but it returns references. Example: “@references MyClass”.
@file - Attaches a file. You can use file_name:LINE1-LINE2 notation for large files to be more specific, for example “@file large_file.ext:42” or “@file large_file.ext:42-56”.
@symbols-at - Looks up any symbols near a given line in a file, and adds the results to the chat context. Uses the same procedure as code completion does. For it to work, you need to specify file and line number: “@symbols-at some_file.ext:42”.

When you start a new chat, there are options available:

“Search workspace” is equivalent to typing @workspace in the input field: it will use your question as a search query.
”Attach current_file.ext” is equivalent to “@file current_file.ext:CURSOR_LINE” command that attaches the file, and uses the current cursor position (CURSOR_LINE) to deal with potentially large files.
”Lookup symbols” extracts any symbols around the cursor and searches for them in the AST index. It’s equivalent to “@symbols-at current_file.ext:CURSOR_LINE”.
”Selected N lines” adds the current selection as a snippet for model to analyze or modify, it’s equivalent to writing a piece of code in backquotes my code.

Interesting Things You Can Try with RAG

Summarize a File

Take a large file, open chat (Option+C) or toolbox (F1), and type “summarize in 1 paragraph”. The post-processing described above makes the file fit the chat context you have available. Check out how the file looks in the tooltip for the 📎 Attached file. The bigger the original file, the more skeletonized version you’ll see.

Summarize Interaction

Unfamiliar code is a big problem for humans: it might take hours to understand the interaction of several classes. Here’s another way to do it: use @definition or @file to put the classes of interest to the context, and ask chat how they interact.

Code Near Cursor with Context

You can add context to chat using the same procedure as code completion: use “Lookup symbols near cursor” or @symbols-at command.

So How Good Is It?

We tested code completion models with and without RAG, and here are the results.

We’ve made a small test that is easy to understand and interpret, it works like this: take 100 random repositories from github for each programming language, delete a random string in a random function, and run single-line code completion to restore it exactly.

It is not perfect, because sometimes it’s hard to reproduce comments exactly (if there’s any on the deleted line), and there are many easy cases (like a closing bracket) that will not benefit from RAG at all. But still it’s a good test because it’s simple! Here is the dataset.

The results of the test (running StarCoder2/3b):

Takeaways:

RAG always helps!
It helps Java more than Python, because many projects in Python don’t use type hints.

Here it is! We believe we've developed the best in-IDE RAG for code completion and chat, at least among the open-source solutions we've explored.

If you have any questions, feel free to ask them in the comments. I'll be glad to answer or have a discussion. Or welcome to join our Discord!

P.S. You can try RAG for coding in Refact.ai Pro plan — use promo code RAGROCKS for a 2-month free trial. Have fun!

📝 🚀 Creating our first documentation from scratch using Astro and Refact AI coding assistant

Vadim Smirnov — Mon, 18 Sep 2023 17:36:16 +0000

At Refact, we put performance and developer experience as our top priority. That's why we were looking for a lightweight and highly customizable template for our documentation.

Previously, we used Astro for our refact.ai website and wanted to stay within the Astro ecosystem for the documentation.

Let's see how easy it will be to build the docs using the Starlight template from Astro with the help of the Refact AI coding assistant.

Refact: Open-source AI coding assistant

Just a quick background about us: Refact is an open-source AI coding assistant that can significantly boost developer productivity with tools like code completion, code refactoring, and chat.

Building the Starlight docs

Starlight is a framework-agnostic Astro template for building documentation websites. With a straightforward configuration process and the support of Markdown and MDX, you get a solid foundation for your documentation.

In this article, we will follow the manual setup process to get a better understanding of all of the parts that are required for Starlight to work.

Creating an Astro project

First, we must create an Astro project as a foundation since Starlight is built on top of Astro.

To do that, we need to run the following command in your terminal:



npm create astro@latest

You can pick the option with the sample files.

Once the setup process is completed, you can run npm run dev in your terminal and see a starter template in your browser.

With Astro, you get an extremely straightforward structure of the project:



/
├── public/
│   └── favicon.svg
├── src/
│   ├── components/
│   │   └── Card.astro
│   ├── layouts/
│   │   └── Layout.astro
│   └── pages/
│       └── index.astro
└── astro.config.mjs

Adding Starlight

Now that we have the foundation, we are ready to add Starlight to turn our website into a fully functional documentation.

To do that, run the following command in your terminal:



npx astro add starlight

After completing the multistep installation process in your terminal, we can see that our Astro configuration file was updated.



import { defineConfig } from 'astro/config';

+ import starlight from "@astrojs/starlight";

export default defineConfig({
+   integrations: [starlight()]
});

Perfect! Now, we have an Astro project with a Starlight integration.

Configuring Starlight

Even though we installed the Starlight integration, we need to complete three more steps to see it working:

We need to configure the plugin
The next step will be a content collection configuration
And last but not least, we need to add markdown files to see the complete documentation

This process is very straightforward, but Refact will help us to boost the configuration process with the autocomplete functionality.

First, let's configure the plugin. Starlight allows you to set many things in the configuration, but only one property is required for the plugin to function, which is a title.

In your astro.config.mjs file, you should add a title inside the Starlight integration. Refact immediately starts to help you with that!



import { defineConfig } from 'astro/config';

import starlight from "@astrojs/starlight";

export default defineConfig({
    integrations: [starlight(
+       {
+           title: "My awesome documentation"
+       }
    )]
});

The next step is to configure the content collection. The content collection allows us to handle the dynamic generation of pages from markdown files.

To configure the content collection, we need to complete the following steps:

Create a folder structure for the documentation
Create the config file and define the collection with the docs schema

For documentation to function, we need to create a content folder inside the src directory, which will be the root directory for the documentation-specific files.

Inside that folder, we need to create the config.ts file to define the collection.

First, let's import two functions that we will use - defineCollection and doscSchema.



+ import { defineCollection } from 'astro:content'
+ import { docsSchema } from '@astrojs/starlight/schema'

Now, we are ready to define the collection. Refact will pick it up and help you to complete that logic, which speeds up the process for you and helps you with guidance on what needs to be completed

When the configuration process is completed, let's add content to see the documentation in action.

Inside the content folder, we need to add the docs directory. There, we should add an index.md file with the following structure:



---
title: "My Documentation"
description: "This is a getting started guide of my documentation."
---

This is the first segment of my documentation!

If we start the project locally, we will see our beautiful documentation functioning as expected!

Customizing your documentation

The results that we achieved are great, but it looks very generic. That's why Starlight allows you to customize the documentation and make it more personalized.

Let's modify the frontmatter in index.md file to turn it into a landing page of your documentation. Plenty of options are available for you to make a great-looking landing page. At Refact, we used the following:

template - allows you to turn a regular documentation page into a landing page
hero - an option with tagline, image, and actions properties to generate a hero banner

In addition to that, Starlight comes with components that allow you to add more code to the landing page.

Don't forget to change your index.md to index.mdx for components to work!

As a final result, here's how the code for the landing page might look:



---
title: Welcome to Refact Documentation!
description: Refact is a powerful AI coding assistant that combines completion, refactoring, chat, and more.
template: splash
hero:
  tagline: Refact is a powerful AI coding assistant that combines completion, refactoring, chat, and more.
  image:
    file: ../../assets/refact.png
  actions:
    - text: Get started
      link: /getting-started/what-is-refact/
      icon: right-arrow
      variant: primary
    - text: VS Code Extension
      link: <https://marketplace.visualstudio.com/items?itemName=smallcloud.codify>
    - text: Jetbrains Plugin
      link: <https://plugins.jetbrains.com/plugin/20647-codify>
---

import { Card, CardGrid } from '@astrojs/starlight/components';

## Features

<CardGrid stagger>
    <Card title="Code Completion" icon="pencil">
        As you write code, Refact suggests potential code completions based on the context of your code,
    looking up and down. It can suggest whole functions, classes, commonly used programming patterns,
    libraries, and APIs usage.
    </Card>
    <Card title="Improve code" icon="magnifier">
        Refact can identify code that could be refactored to be more efficient or easier to understand.
    It can also detect bugs in your code and generate patches to fix them.
    </Card>
    <Card title="AI Chat" icon="rocket">
        Use plain language prompts in Refact chat to ask questions or get help with
    writing code without leaving your IDE.
    </Card>
    <Card title="Transform and analyze code" icon="setting">
        Refact can analyze the complexity of your code and explain unclear lines of code.
    It can also transform your code into a different language.
    </Card>
</CardGrid>

The following code generated a great-looking landing page with lots of helpful content.

Changing the color theme

As you saw in the screenshot, the color theme we use for the Refact documentation differs from the one available out of the box. Starlight makes it very straightforward to change the color scheme.

Starlight allows you to add custom styles to match the documentation colors with colors that represent your brand.

First, we must create a custom.css file in the src/styles folder. After that, we must link that file to the astro.config.mjs file.



import { defineConfig } from 'astro/config';

import starlight from "@astrojs/starlight";

export default defineConfig({
    integrations: [Starlight (
        {
            title: "My awesome documentation",
+           customCss: [
+               './src/styles/custom.css'
+           ]
        }
    )
    ]
});

You can modify the color theme in your newly created CSS file. Here is an example of styles that we use at Refact to match our brand colors:



/* Dark mode colors. */
:root {
--sl-color-accent-low: #42100e;
--sl-color-accent: #c70016;
--sl-color-accent-high: #f8b6ad;
--sl-color-white: #ffffff;
--sl-color-gray-1: #eeeeee;
--sl-color-gray-2: #c2c2c2;
--sl-color-gray-3: #8b8b8b;
--sl-color-gray-4: #585858;
--sl-color-gray-5: #383838;
--sl-color-gray-6: #272727;
--sl-color-black: #181818;
}

/* Light mode colors. */
:root[data-theme='light'] {
--sl-color-accent-low: #fcc9c2;
--sl-color-accent: #cb0017;
--sl-color-accent-high: #610b0c;
--sl-color-white: #181818;
--sl-color-gray-1: #272727;
--sl-color-gray-2: #383838;
--sl-color-gray-3: #585858;
--sl-color-gray-4: #8b8b8b;
--sl-color-gray-5: #c2c2c2;
--sl-color-gray-6: #eeeeee;
--sl-color-gray-7: #f6f6f6;
--sl-color-black: #ffffff;
}

If you refresh the page, you can see that the color theme was updated!

Conclusion

And there we have it!
We've explored the Starlight project - a fantastic solution for the documentation, and we used Refact - an AI coding assistant that immediately picked up our project's context and helped us get used to an unfamiliar code base a lot faster.

Next Steps

With a solid foundation and an incredible AI ally, you can explore the Starlight project to see how to personalize and make your documentation stunning.
In addition, I encourage you to try Refact and see how it will improve your productivity and help you become more efficient as an engineer.

Join our Discord server if you need assistance from the Refact team and community to get started faster!

Open-source Fine-Tuning on Codebase with Refact

Refact AI — Tue, 05 Sep 2023 10:18:54 +0000

Code completion has become increasingly popular, thanks to tools like GitHub Copilot and open-source Large Language Models (LLMs). However, both Copilot and open models often fall short when it comes to working effectively on your specific codebase. This is because these models have never been exposed to your unique code patterns and conventions.
In order to improve the quality of suggestions and tailor them to your codebase there's a technique called fine-tuning. By fine-tuning a pre-trained model on your codebase, you can improve its ability to understand and generate code that aligns with your requirements.
In this blog post, we will delve into the concept of fine-tuning, and its technical details, and show how you can start self-hosting your fine-tuned model in Refact.

Example

In this video, the same simple function is generated by: Copilot, base Refact 3b model, fine-tuned Refact 3b model.
All three can look down the code, find what variables are necessary, and help you with typing, but only the finetuned version knows how to work with DatasetOpts.

How Exactly Fine-tune Works?

Large language models work by predicting the next token. This simple objective allows LLMs to learn syntax, code patterns, and even high-level concepts.
The code you write is probably different from all the other projects on the internet. It might be similar - that's why code LLMs are already useful - but you probably have your own established way to do things.
One simple example is coding style. Predicting the next token in a certain way defines how a model writes code, including variable names, spaces, etc.
Fine-tuning has the same objective as pre-training: predict the next token. By adjusting the parameters in a clever way (it needs only one GPU to train!), the model starts to predict the next token according to your coding style, as well as patterns, your typical API usage, etc.
That's why you'll see more useful suggestions if you are using a fine-tuned model.

What Data Can I Use for Fine-tuning the Model?

In Refact UI, you will need to upload source code, in archive form (.zip, .tar.gz, .bz2) or give it a link to a git repository (private git repositories work too, you need to generate a ssh key though). You can upload an individual file, too. Refact then will slice your source code into pieces that a model can actually train on.
It's a good idea to give the model the current code of your projects. However, it's NOT a good idea to feed 3rd party libraries that you use, as the model may learn to generate code similar to the internals of those libraries.

Test Loss

In order to measure how well the model is adapted to your code, you can take one or two of your files and make it a test set. To be meaningful as a measurement, these files should be using your coding style, your libraries and APIs.

<img src="https://refact.ai/images/blog/refact-finetune/sources-code.png">
<span>Picture: shows <code>vllm</code> github repository as a training set, and a single file <code>benchmark_serving.py</code> as a fixed test set</span>

If test files are also present in the train set, they will be automatically subtracted from it.
If you don't specify any test set, it will pick several random files for you.

Technical Details

It's possible to fine-tune all parameters (called "full fine-tune"), but recently PEFT methods became popular. PEFT stands for Parameter-Efficient Fine-Tuning. There are several methods available, the most popular so far is LoRA (2106.09685) that can train less than 1% of the original weights.
LoRA has one important parameter -- tensor size, called lora_r. It defines how much information LoRA can add to the network. If your codebase is small, the fine-tuning process will see the same data over and over again, many times in a loop. We found that for a smaller codebase small LoRA tensors work best because it won't overfit as much -- the tensors just don't have the capacity to fit the limited training set exactly.
As the codebase gets bigger, tensors should become bigger as well. We also unfreeze token embeddings at a certain codebase size.
To pick all the parameters automatically, we have developed a heuristic that calculates a score based on the source files it sees. This score is then used to determine the appropriate LoRA size, number of finetuning steps, and other parameters. We have tested this heuristic on several beta test clients, small codebases of several files, and large codebases like the Linux kernel (consisting of about 50,000 useful source files).
If the heuristic doesn't work for you for whatever reason, you can set all the parameters yourself.

How to Test If It Worked?

After the fine-tuning process finishes (which should take several hours), you can dynamically turn it on and off and observe the difference it makes for code suggestions. You can do this using this switch:

<img src="https://refact.ai/images/blog/refact-finetune/lora-select.png">

There's a catch: both VS Code and JB plugins cache the responses. To force the model to produce a new suggestion (rather than immediately responding with a cached one), you can change the text a few lines above, for example, a comment.
Alternatively, you can use the Manual Suggestion Trigger (a key combination), which always produces a new suggestion.

Self Hosting

You can use your own GPU to host and fine-tune LLMs with Refact self-hosting server.

FAQ

Q: Maybe models can guess code better if they have more context, especially from other files?
A: For the best results, you need both. Fine-tuning gives you the coding style, and if the model can see relevant snippets of code from other files, it will work better for calling functions and using types defined outside of the current file. We are currently working on that, too. Join our discord server and be the first to know when we release it!
Q: I only want to imitate the coding style of certain experts on my team. Is this possible?
A: Certainly! It is indeed possible to imitate the coding style of specific experts on your team. You can achieve this by selectively uploading the files that represent the desired coding style and excluding any old or low-quality code. By doing so, the model will generate code that aligns with the chosen coding style. This approach can be valuable in transferring expert knowledge within your company, as the coding assistant can consistently suggest good coding practices.

🤖We trained a small 1.6b code model that reaches 32% HumanEval🤖

Refact AI — Tue, 05 Sep 2023 10:15:23 +0000

Today we're introducing Refact LLM: 1.6B code model with infill real-time code completion (including fill-in-the-middle(FIM) capability) and chat.
Refact LLM achieves the state-of-the-art performance among the code LLMs, coming closer to HumanEval as Starcoder, being 10x smaller in size, and it beats other code models such as StableCode, CodeGen and ReplitCode on HumanEval metric.

Summary:

1.6b parameters
20 programming languages
4096 tokens context
code completion and chat capabilities
SoTA on HumanEval benchmark among similar code models
pre-trained on permissive licensed code and available for commercial use

Model	Model Size	HumanEval pass@1
DeciCoder-1b	1b	19.1%
Refact-1.6-fim	1.6b	32.0%
StableCode	3b	20.2%
ReplitCode v1	3b	21.9%
CodeGen2.5-multi	7b	28.4%
CodeLlama	7b	33.5%
StarCoder	15b	33.6%

The base model was trained on our own set of code with permissive licenses only and open text datasets (the text to code ratio was 50:50). In total, we trained our base model on 1.2T tokens of code on our cluster.

The model was then fine-tuned with open code instruction-following datasets filtered for quality and a synthetic dataset based on The Stack dedup v1.1 to improve FIM and boosting the base model performance.

You can read more about the architecture decisions that we made in the blog post.

We aim for the model to be accessible to everyone, we're releasing the model for commercial use under BigScience OpenRAIL-M license and making the weight available on HuggingFace.

While the trend recently was for the model sizes to get bigger, we wanted to lower barriers to entry and make it a versatile tool for developers with varying hardware setups. With the smaller size, running the model is much faster and affordable than ever: the model can be served on most of all modern GPUs requiring just 3Gb RAM and works great for real-time code completion tasks.

Refact LLM can be easily integrated into existing developers workflows with an open-source docker container and VS Code and JetBrains plugins. With Refact's intuitive user interface, developers can utilize the model easily for a variety of coding tasks. Finetune is available in the self-hosting (docker) and Enterprise versions, making suggestions more relevant for your private codebase.

Refact 1.6B LLM is the third model in the family of our code models, with CodeContrast 3b and CodeContrast 0.3b released previously. We aim to continue with our research and future updates to improve the LLM's performance and capabilities. We would love to get community contributions and feedback to enhance the model further. For any questions and ideas, please visit our Discord.