<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Conrad Bogusz</title>
    <description>The latest articles on DEV Community by Conrad Bogusz (@conradbogusz).</description>
    <link>https://dev.to/conradbogusz</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3397292%2F79419ff7-eed3-4842-b326-a764c89f149d.jpg</url>
      <title>DEV Community: Conrad Bogusz</title>
      <link>https://dev.to/conradbogusz</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/conradbogusz"/>
    <language>en</language>
    <item>
      <title>Prompt caching vs Statefull approach</title>
      <dc:creator>Conrad Bogusz</dc:creator>
      <pubDate>Mon, 04 Aug 2025 11:09:50 +0000</pubDate>
      <link>https://dev.to/conradbogusz/prompt-caching-vs-statefull-approach-22h1</link>
      <guid>https://dev.to/conradbogusz/prompt-caching-vs-statefull-approach-22h1</guid>
      <description>&lt;p&gt;We still haven’t figured out how to optimize binge-watching all 38 seasons of The Simpsons 🍿. Luckily, there’s no real need to, but in the world of LLMs, it’s a whole different story.&lt;/p&gt;

&lt;p&gt;Here’s how some providers handle prompt caching:&lt;/p&gt;

&lt;p&gt;✅ OpenAI: Auto-caches the longest matching prefix (after 1,024 tokens, then in 128-token chunks). No config needed; up to 80% lower latency and 50% lower input costs.&lt;br&gt;
 ⚙️ Anthropic: Manual caching via headers (𝘢𝘯𝘵𝘩𝘳𝘰𝘱𝘪𝘤-𝘣𝘦𝘵𝘢: 𝘱𝘳𝘰𝘮𝘱𝘵-𝘤𝘢𝘤𝘩𝘪𝘯𝘨-𝘠𝘠𝘠𝘠-𝘔𝘔-𝘋𝘋 + 𝘤𝘢𝘤𝘩𝘦_𝘤𝘰𝘯𝘵𝘳𝘰𝘭). Works only for exact prefix matches. Reading saves ~90% cost; writes add ~25%.&lt;br&gt;
 🔧 AWS Bedrock: Opt-in with 𝘌𝘯𝘢𝘣𝘭𝘦𝘗𝘳𝘰𝘮𝘱𝘵𝘊𝘢𝘤𝘩𝘪𝘯𝘨=𝘵𝘳𝘶𝘦, TTL of 5 minutes. Saves up to 90% on input and 85% on latency.&lt;br&gt;
📦 Google Vertex: Manual 𝘊𝘢𝘤𝘩𝘦𝘥𝘊𝘰𝘯𝘵𝘦𝘯𝘵, caches by token-hours, up to 75% discount on reads, TTL up to 1 hour. More complex to manage.&lt;/p&gt;

&lt;p&gt;If you are more interested into new method and possibilites with STATEFUL API for AI model inference go to ark-labs.cloud&lt;/p&gt;

&lt;p&gt;Give it a try at ark-labs.cloud 🚀&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>api</category>
    </item>
    <item>
      <title>Meet ARKLABS API: Stateful AI Inference you never heard about</title>
      <dc:creator>Conrad Bogusz</dc:creator>
      <pubDate>Thu, 31 Jul 2025 10:14:43 +0000</pubDate>
      <link>https://dev.to/arklabscloud/meet-arklabs-api-stateful-ai-inference-you-never-heard-about-32mb</link>
      <guid>https://dev.to/arklabscloud/meet-arklabs-api-stateful-ai-inference-you-never-heard-about-32mb</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; ARK Cloud API launches today with stateful AI inference (&lt;strong&gt;almost free input tokens&lt;/strong&gt;), signup &amp;amp; model inference in under 10 s with Google SSO (no credit card needed), and &lt;strong&gt;up to 71% cost savings&lt;/strong&gt; on Stable Diffusion 3.5 Large inference — all running on &lt;strong&gt;100% EU‑based infrastructure&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Stateful AI Matters
&lt;/h2&gt;

&lt;p&gt;Most AI APIs are &lt;strong&gt;stateless&lt;/strong&gt;—meaning you resend the same context over and over, burning budget and GPU cycles. As inference demand skyrockets, this inefficiency becomes a bottleneck. Enter &lt;strong&gt;stateful inference&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Almost-Zero-Cost Input Tokens&lt;/strong&gt;
Context persists across calls, so you never overpay for tokens you’ve already sent.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimized GPU Utilization&lt;/strong&gt;
Less recompute = more throughput on the same hardware.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We built ARK Cloud API to fix that. Our &lt;strong&gt;stateful mode&lt;/strong&gt; “remembers” your context so your &lt;strong&gt;input tokens cost zero&lt;/strong&gt;—forever. That means richer, longer conversations and way more efficient GPU use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;🚀 &lt;strong&gt;10-Second Onboarding&lt;/strong&gt;
Google SSO → Dashboard → API Key. Blink, and you’re running inference.
&lt;/li&gt;
&lt;li&gt;💰 &lt;strong&gt;50 000 Free Credits&lt;/strong&gt;
No credit card required. Fuel LLMs, STT, embeddings, and Stable Diffusion.
&lt;/li&gt;
&lt;li&gt;🔀 &lt;strong&gt;OpenAI-Compatible API&lt;/strong&gt;
Swap endpoints, keep your existing code.
&lt;/li&gt;
&lt;li&gt;🇪🇺 &lt;strong&gt;100% EU Infrastructure&lt;/strong&gt;
GDPR-strong, no logs, no stored data.
&lt;/li&gt;
&lt;li&gt;💸 &lt;strong&gt;Pay-As-You-Go&lt;/strong&gt;
Only pay for output tokens and compute time.
&lt;/li&gt;
&lt;li&gt;🎨 &lt;strong&gt;Cheapest Stable Diffusion&lt;/strong&gt;
Best price on the market for Stable Diffusion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Visit ark-labs.cloud&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sign in with Google (⏱️ 10s)&lt;/li&gt;
&lt;li&gt;Claim your 50 000 free credits&lt;/li&gt;
&lt;li&gt;Integrate your existing calls to ARK Cloud API&lt;/li&gt;
&lt;li&gt;Scale with confidence—no hidden fees, total privacy&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>openai</category>
    </item>
  </channel>
</rss>
