<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sandy Shen</title>
    <description>The latest articles on DEV Community by Sandy Shen (@sandysdn).</description>
    <link>https://dev.to/sandysdn</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3854317%2Fcf018690-9a83-475c-8c96-538713d2b401.png</url>
      <title>DEV Community: Sandy Shen</title>
      <link>https://dev.to/sandysdn</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sandysdn"/>
    <language>en</language>
    <item>
      <title>Reducing bootstrap memory cost in LLM agents</title>
      <dc:creator>Sandy Shen</dc:creator>
      <pubDate>Wed, 01 Apr 2026 17:22:25 +0000</pubDate>
      <link>https://dev.to/sandysdn/reducing-bootstrap-memory-cost-in-llm-agents-2i7g</link>
      <guid>https://dev.to/sandysdn/reducing-bootstrap-memory-cost-in-llm-agents-2i7g</guid>
      <description>&lt;p&gt;LLM agents are stateless by default. To get continuity, the standard approach is to load everything into the system prompt. Logs, past decisions, project state.&lt;/p&gt;

&lt;p&gt;It works, but it is wasteful. We were spending 3,500+ tokens on memory before the agent even started doing anything useful. If you load nothing, you get the opposite problem. The agent forgets preferences and repeats the same mistakes every session.&lt;/p&gt;

&lt;p&gt;We stopped trying to tune the context window and changed how memory is handled.&lt;/p&gt;

&lt;p&gt;Instead of loading everything at once, we split memory into three parts:&lt;br&gt;
&lt;strong&gt;Hot&lt;/strong&gt;: A small set of curated facts that are always loaded, around 625 tokens.&lt;br&gt;
&lt;strong&gt;Warm&lt;/strong&gt;: Recent logs from the last 7 days, only pulled in when needed.&lt;br&gt;
&lt;strong&gt;Cold&lt;/strong&gt;: Older history stored externally and not loaded by default.&lt;br&gt;
Most of the time, the agent only needs one or two specific pieces of context.&lt;/p&gt;

&lt;p&gt;That simple change made a big difference.&lt;br&gt;
&lt;strong&gt;In our setup, bootstrap memory cost dropped from around 3,500 tokens to about 125 tokens, roughly a 96 percent reduction.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We are preparing the open-source release of the OpenClaw Auto Memory Manager. Full write up here:&lt;br&gt;
&lt;a href="https://zflow.ai/zflow_ai_insights_article_4.html" rel="noopener noreferrer"&gt;https://zflow.ai/zflow_ai_insights_article_4.html&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>devops</category>
      <category>llm</category>
    </item>
    <item>
      <title>Stop tuning LLM agents with live API calls: A simulation-based approach</title>
      <dc:creator>Sandy Shen</dc:creator>
      <pubDate>Wed, 01 Apr 2026 00:51:15 +0000</pubDate>
      <link>https://dev.to/sandysdn/stop-tuning-llm-agents-with-live-api-calls-a-simulation-based-approach-1oe6</link>
      <guid>https://dev.to/sandysdn/stop-tuning-llm-agents-with-live-api-calls-a-simulation-based-approach-1oe6</guid>
      <description>&lt;p&gt;LLM agent configuration is a surprisingly large search space, including model choice, thinking depth, timeout, and context window. Most teams pick a setup once and never revisit it. Manual tuning with live API calls is slow and expensive, and usually only happens after something breaks.&lt;/p&gt;

&lt;p&gt;We explored a different approach: simulate first, then deploy. Instead of calling the model for every trial, we built a lightweight parametric simulator and replayed hundreds of configuration variants offline. A scoring function selects the lowest-cost configuration that still meets quality requirements.&lt;/p&gt;

&lt;p&gt;The full search completes in under 5 seconds.&lt;/p&gt;

&lt;p&gt;A few patterns stood out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Many agents are over-configured by default &lt;/li&gt;
&lt;li&gt;Token usage can often be reduced without impacting output quality &lt;/li&gt;
&lt;li&gt;Offline search is significantly faster than live experimentation &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;In practice, this approach reduced token cost by around 20-40% on real workloads.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We’re currently preparing the open-source release of the OpenClaw Auto-Tuner. If you’re interested, you can check the full write-up here: &lt;br&gt;
&lt;a href="https://zflow.ai/zflow_ai_insights_article_3.html" rel="noopener noreferrer"&gt;https://zflow.ai/zflow_ai_insights_article_3.html&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
