<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aidan Do</title>
    <description>The latest articles on DEV Community by Aidan Do (@aidando73).</description>
    <link>https://dev.to/aidando73</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2611592%2Fca836770-3db3-4b86-8a88-c5cffe65b223.jpeg</url>
      <title>DEV Community: Aidan Do</title>
      <link>https://dev.to/aidando73</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aidando73"/>
    <language>en</language>
    <item>
      <title>Building a Coding Agent from Scratch with Llama 70B: Lessons Learned</title>
      <dc:creator>Aidan Do</dc:creator>
      <pubDate>Sat, 04 Jan 2025 03:51:34 +0000</pubDate>
      <link>https://dev.to/aidando73/building-a-coding-agent-from-scratch-with-llama-70b-lessons-learned-48c4</link>
      <guid>https://dev.to/aidando73/building-a-coding-agent-from-scratch-with-llama-70b-lessons-learned-48c4</guid>
      <description>&lt;p&gt;I recently built a coding agent using Llama 3.3 70B (see &lt;a href="https://x.com/aidando73/status/1875455780784365864" rel="noopener noreferrer"&gt;demo on x&lt;/a&gt;). While my agent's performance on SWE Bench Lite (5%) is not that impressive compared to the leaders (48.33%), I'd like to share some of my learnings in case there's something here you didn't already know.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why do this
&lt;/h2&gt;

&lt;p&gt;If you take a look on the &lt;a href="https://www.swebench.com/" rel="noopener noreferrer"&gt;SWE bench lite leaderboard&lt;/a&gt;, 80-90% of the leaderboard are closed models. I want to see how far we can push open source models and am excited to see the point where people are using open source models for real coding assistant use-cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Learnings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Keep It Simple
&lt;/h3&gt;

&lt;p&gt;Sticking to the simplest (sometimes brute force) solutions gave me a lot more control and visibility into the agent's behavior. For example, I could quickly gather operational metrics using simple string searches over logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Syntax error: 46
File not found: 114
Total errors (post tool call): 122
Total tool calls: 670
list_files tool calls: 245
edit_file tool calls: 78
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also left out bash tool and python execution at the start, because I saw on OpenHands, Llama 405b gets fairly lost when trying to execute arbitrary bash commands and python scripts. So I started with just 3 tool calls edit_file, list_file and view_file to start off with and I was able to get to 5% with just those three tools.&lt;/p&gt;

&lt;p&gt;Anthropic says it pretty well here:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://www.anthropic.com/research/building-effective-agents" rel="noopener noreferrer"&gt;source&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The Power of Raw Prompts
&lt;/h3&gt;

&lt;p&gt;Raw prompts seem intimidating at first, but they're pretty straight forward and give you way more power. For instance all chat completion APIs I could find with Llama, don't allow text content alongside tool calls. Looking under the hood, the standard Meta tool call prompt explicitly states not to have any text alongside tool calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://github.com/meta-llama/llama-models/blob/675e4be3973f70a6441cc0302766a1669a99db1f/models/llama3/prompt_templates/system_prompts.py#L217" rel="noopener noreferrer"&gt;source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With raw prompts, not only could I achieve thinking alongside tool calls, I got a lot less tool call parsing errors since I had control over the parsing logic and could parse tool calls that some implementations would have otherwise rejected.&lt;/p&gt;

&lt;p&gt;I suspect this is a big advantage open source models have over closed source models - you have access to the prompt formats. Ashwin, one of the core maintainers of Llama Stack has also &lt;a href="https://gist.github.com/aidando73/943b5f02d35571eb783f3cca5afa6e59?permalink_comment_id=5332243#gistcomment-5332243" rel="noopener noreferrer"&gt;mentioned this point&lt;/a&gt;. Meta post their &lt;a href="https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/prompt_format.md" rel="noopener noreferrer"&gt;prompt formats&lt;/a&gt;, and you can view &lt;a href="https://github.com/meta-llama/llama-stack" rel="noopener noreferrer"&gt;llama-stack&lt;/a&gt; for a reference implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Less Context Can Be More
&lt;/h3&gt;

&lt;p&gt;More context isn't always better. Initially, I included the entire repository file tree in the context, thinking it would help navigation. This not only cost me $250 in evaluation runs but actually hurt performance. Excluding the file tree completely led to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2-3x performance boost in SWE bench pass rate&lt;/li&gt;
&lt;li&gt;10x reduction in costs (from ~$30 to ~$3 for 50 instances)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Tool Names Matter
&lt;/h3&gt;

&lt;p&gt;Something as simple as renaming a tool from &lt;code&gt;replace_in_file&lt;/code&gt; to &lt;code&gt;edit_file&lt;/code&gt; decreased empty patch files by 22% (from 77% to 60.3%). The agent also complained less about not having the right tools to solve problems. I suspect there are a lot of opportunities for improvement here - finding tools that LLM find easy to use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Current Limitations
&lt;/h2&gt;

&lt;p&gt;To be honest - at 5% on SWE bench lite, this agent isn't ready for production use. It's still far behind agents using Claude (30-50% range). And is still behind the top open source performers on SWE Bench Lite:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IBM AI Agent SWE-1.0: 23.67%&lt;/li&gt;
&lt;li&gt;Moatless tree search (Llama 3.1 70B): 17.7%&lt;/li&gt;
&lt;li&gt;OpenHands CodeAct (Llama 3.1 405B): 14%&lt;/li&gt;
&lt;li&gt;OpenHands CodeAct (Llama 3.1 70B): 9%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But I'm optimistic, I see a lot of room for improvement in areas like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High rate of empty patch files (59%)&lt;/li&gt;
&lt;li&gt;Tool call errors&lt;/li&gt;
&lt;li&gt;Unproductive looping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I'm keen to see how far I can push this agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;So those were the key learnings I got from doing this project. I hope you took something away from it. The code is open source at &lt;a href="https://github.com/aidando73/l2-llama" rel="noopener noreferrer"&gt;github.com/aidando73/l2-llama&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Feel free to comment below or message me on &lt;a href="https://x.com/aidando73" rel="noopener noreferrer"&gt;x.com&lt;/a&gt; if you have any feedback or questions.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>learning</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
