<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Emmanuel Ekunsumi</title>
    <description>The latest articles on DEV Community by Emmanuel Ekunsumi (@tokoscope).</description>
    <link>https://dev.to/tokoscope</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4007954%2Fa12c85dc-2158-4a81-8cdf-5e551b7ddb4c.jpeg</url>
      <title>DEV Community: Emmanuel Ekunsumi</title>
      <link>https://dev.to/tokoscope</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tokoscope"/>
    <language>en</language>
    <item>
      <title>How I Cut LLM API Costs by 60% With 2 Lines of Code</title>
      <dc:creator>Emmanuel Ekunsumi</dc:creator>
      <pubDate>Mon, 29 Jun 2026 11:00:09 +0000</pubDate>
      <link>https://dev.to/tokoscope/how-i-cut-llm-api-costs-by-60-with-2-lines-of-code-li2</link>
      <guid>https://dev.to/tokoscope/how-i-cut-llm-api-costs-by-60-with-2-lines-of-code-li2</guid>
      <description>&lt;p&gt;Our OpenAI bill tripled in 60 days.&lt;/p&gt;

&lt;p&gt;User growth was up 40%. Revenue was up. But the API bill was growing 3x faster than everything else.&lt;/p&gt;

&lt;p&gt;I spent a week digging into why. What I found was embarrassing and completely fixable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the data showed
&lt;/h2&gt;

&lt;p&gt;After analyzing thousands of real API calls, the same four patterns kept showing up:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Bloated system prompts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most system prompts accumulate over time. Teams add instructions for edge cases, add clarifications, add reminders and never remove anything. The result: system prompts that say the same thing four different ways.&lt;/p&gt;

&lt;p&gt;Here's a real example before and after:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Before - 89 tokens&lt;/span&gt;
You are a helpful customer support assistant. Please make sure to always
be polite and professional in your responses. It is very important that
you respond to customer questions in a helpful manner. Make sure to note
that you should always try to resolve the customer's issue. Please be
concise but also make sure to be thorough. Always maintain a professional
tone and make sure to be empathetic to the customer's situation.

&lt;span class="gh"&gt;# After — 18 tokens&lt;/span&gt;
You are a polite, professional customer support assistant.
Resolve issues concisely and empathetically.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same model behavior. 80% fewer tokens on every single call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. No semantic caching&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Exact match caching catches identical prompts. But users don't ask the same question the same way twice.&lt;/p&gt;

&lt;p&gt;"How do I reset my password?" and "I forgot my password, what do I do?" should return the same cached response. Without semantic caching, both hit the API and cost tokens.&lt;/p&gt;

&lt;p&gt;We were running a customer support bot with hundreds of near-duplicate requests every day. Every one was hitting the API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Stuffed context windows&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We defaulted to sending full conversation history on every call. Turns out only the last 3-4 turns actually influenced the output. We were paying for 20 turns of context the model was mostly ignoring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Zero visibility&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We had no idea which feature was burning the most tokens. No breakdown by endpoint. No cost per feature. Just a monthly invoice.&lt;/p&gt;

&lt;p&gt;Turns out our onboarding flow was costing 10x more per user than our core product and we'd never thought to optimize it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: two lines of code
&lt;/h2&gt;

&lt;p&gt;I built &lt;a href="https://tokoscope.com" rel="noopener noreferrer"&gt;Tokoscope&lt;/a&gt; to solve this. It wraps your existing LLM client and handles everything automatically:&lt;/p&gt;

&lt;h3&gt;
  
  
  JavaScript
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;wrap&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tokoscope&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="c1"&gt;// Before&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;// After — that's it&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wrap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ts_live_...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="c1"&gt;// from app.tokoscope.com/settings&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;// All your existing calls work unchanged&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Python
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tokoscope&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;wrap&lt;/span&gt;

&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# After
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wrap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ts_live_...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# All your existing calls work unchanged
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Hello&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Works with Anthropic and Gemini too, same pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happens automatically
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prompt compression
&lt;/h3&gt;

&lt;p&gt;Every prompt gets scored for waste. High-waste prompts are automatically rewritten to their minimum effective form using Claude Haiku before being tracked.&lt;/p&gt;

&lt;p&gt;Real result from our own testing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Original — 113 tokens&lt;/span&gt;
Please note that it is very important that you make sure to respond to
my question. As an AI, I want you to please make sure that you
understand that I need you to help me. Make sure to note that what I am
asking you is the following question which is important:
What is the capital of France? Please make sure to answer clearly.

&lt;span class="gh"&gt;# Compressed — 8 tokens&lt;/span&gt;
What is the capital of France? Answer concisely.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;90% token reduction. Same answer from the model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic caching
&lt;/h3&gt;

&lt;p&gt;Two-layer caching system:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 — Exact match:&lt;/strong&gt; Hash the prompt. Identical requests return cached responses instantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 — Semantic match:&lt;/strong&gt; Generate an embedding for the prompt. Compare cosine similarity against cached embeddings. At 85%+ similarity, return the cached response.&lt;/p&gt;

&lt;p&gt;Console output when it fires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;⚡ Tokoscope cache hit [semantic (89.3% match)] — saved 93 tokens ($0.000049)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"What is the population of Japan?" and "How many people live in Japan?" scored 89.3% similarity and correctly served the cached response. Saved a full API call.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost attribution
&lt;/h3&gt;

&lt;p&gt;Every call is logged with token counts, cost, waste score, endpoint, and user ID. The dashboard breaks it down so you can see exactly which feature costs the most.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;Three techniques, compounding:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Technique&lt;/th&gt;
&lt;th&gt;What it catches&lt;/th&gt;
&lt;th&gt;Typical saving&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt compression&lt;/td&gt;
&lt;td&gt;All calls&lt;/td&gt;
&lt;td&gt;~30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exact match caching&lt;/td&gt;
&lt;td&gt;Identical prompts&lt;/td&gt;
&lt;td&gt;~10–15%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Similar prompts&lt;/td&gt;
&lt;td&gt;~20–25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;60–70%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These compound because they operate at different layers. Compression reduces the size of every call. Caching eliminates calls entirely. Attribution tells you where to focus first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-user tracking
&lt;/h2&gt;

&lt;p&gt;If you're building a multi-tenant app, you can pass a user ID to see token usage per end user:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// JavaScript&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wrap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ts_live_...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;currentUser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;Python&lt;/span&gt;
&lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wrap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nx"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ts_live_...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nx"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Users show up in the dashboard with individual token usage, cost, and waste scores.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;tokoscope
&lt;span class="c"&gt;# or&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;tokoscope
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Free tier monitors up to 500K tokens per month. No credit card required.&lt;/p&gt;

&lt;p&gt;The dashboard is at &lt;a href="https://app.tokoscope.com" rel="noopener noreferrer"&gt;app.tokoscope.com&lt;/a&gt; and you can have it running in under 5 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;The biggest surprise wasn't how much waste there was, it was how invisible it was. Without token-level visibility, you're flying blind. You can't optimize what you can't measure.&lt;/p&gt;

&lt;p&gt;The good news: once you can see it, the fixes are usually simple. Compress the system prompt. Add semantic caching. Trim the context window. Three changes, compounding savings.&lt;/p&gt;

&lt;p&gt;The bad news: most teams find out about token waste the same way we did — when the bill arrives.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you're building with LLMs and your API costs keep growing — try Tokoscope. It's free to start and takes 2 minutes to integrate.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Feedback welcome in the comments — especially from anyone who's tried different semantic similarity thresholds or embedding models for caching.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>openai</category>
      <category>python</category>
      <category>javascript</category>
    </item>
  </channel>
</rss>
