<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ken Ahrens</title>
    <description>The latest articles on DEV Community by Ken Ahrens (@kenahrens).</description>
    <link>https://dev.to/kenahrens</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F698681%2F04087e7d-f91d-47fe-8981-105a2c24f8ba.jpeg</url>
      <title>DEV Community: Ken Ahrens</title>
      <link>https://dev.to/kenahrens</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kenahrens"/>
    <language>en</language>
    <item>
      <title>How to Tame Your AI Agents: From $900 in 18 Days to Coding Smarter</title>
      <dc:creator>Ken Ahrens</dc:creator>
      <pubDate>Tue, 12 Aug 2025 23:23:53 +0000</pubDate>
      <link>https://dev.to/kenahrens/how-to-tame-your-ai-agents-from-900-in-18-days-to-coding-smarter-75n</link>
      <guid>https://dev.to/kenahrens/how-to-tame-your-ai-agents-from-900-in-18-days-to-coding-smarter-75n</guid>
      <description>&lt;p&gt;It started with a curiosity and ended with a $900 bill. Eighteen days. Three AI coding agents: Claude Code, Gemini CLI, Cursor and Codex. What could possibly go wrong? Turns out, everything—until I learned how to tame them.&lt;/p&gt;

&lt;p&gt;When I first fired up Cursor back in March, it was like having a hyperactive coding partner who never needed coffee breaks. I used it to freshen up &lt;a href="https://docs.speedscale.com/" rel="noopener noreferrer"&gt;product docs&lt;/a&gt; and tweak a few demo apps. Then Claude Code hit the scene in June and I dove headfirst into something more ambitious: vibecoding a complete &lt;a href="https://github.com/kenahrens/crm-demo" rel="noopener noreferrer"&gt;CRM demo app&lt;/a&gt; (react frontend, go backend, postgres database). That worked so well, I figured—why not push it further?&lt;/p&gt;

&lt;p&gt;Gemini CLI arrived just in time for me to test it on an even bigger challenge: building a &lt;a href="https://github.com/speedscale/microsvc" rel="noopener noreferrer"&gt;banking microservice application&lt;/a&gt; with full OpenTelemetry tracing. Since we use Google Workspace, working with Gemini AI Agent seemed like a no-brainer. But where Claude kept pace and Cursor quickly showed off code changes, Gemini sometimes got lost in its own loops—one particularly wild day ended with it racking up $300 in charges all by itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqmxse4udow7wosyquscl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqmxse4udow7wosyquscl.png" alt="Gemini AI agent bill showing $300 in charges from runaway loops" width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By the end of July, I’d also migrated our marketing site from WordPress to an Astro content site, and GPT-5 Codex had entered the chat. I had four AI development tools at my fingertips and an itch to see how far I could take them. In less than three weeks, I burned through $900 for API costs and monthly subscription fees (about $50 per day of #vibecoding).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzmge8x9h64c9g1yexte4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzmge8x9h64c9g1yexte4.png" alt="Claude Code API bill showing $300 in charges in just a few days" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Costly Lessons
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Don't Let the AI Drive
&lt;/h3&gt;

&lt;p&gt;The biggest mistake I made early on was treating AI agents like senior developers who could just "figure it out." I'd give them vague instructions like "build a microservices app" and watch them spiral into increasingly complex solutions that solved problems I didn't have.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3huxng6jbbyz9dhl5z5t.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3huxng6jbbyz9dhl5z5t.jpg" alt="AI Agents Drive Safely" width="800" height="531"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AI agents work best when managed like talented junior engineers: give them clear requirements, specific constraints, and well-defined deliverables. Create a PLAN.md that breaks down exactly what you want, in what order, with clear boundaries. Then supervise each step before letting them move to the next one. This is a great primer from Rich Stone on how to &lt;a href="https://richstone.io/1-4-code-with-llms-and-a-plan/" rel="noopener noreferrer"&gt;Code with LLMS and a PLAN&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Think of it as technical leadership, not delegation. You're the architect; they're the implementers. If you learn something new about your architecture while building a task from the list, then tell the AI Agent to make a note about it in &lt;code&gt;ARCHITECTURE.md&lt;/code&gt; so it will keep the standards. It really wants to not follow the standards so you may need to remind it frequently.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Docker Identity Crisis
&lt;/h3&gt;

&lt;p&gt;Another one of my painful headaches came from letting an AI mix Docker Compose (for local) and Kubernetes (for production) configs without clear boundaries. One minute it’s spinning up a clean &lt;code&gt;docker-compose.yml&lt;/code&gt; for local dev, the next it’s sprinkling Kubernetes &lt;code&gt;Deployment&lt;/code&gt; YAML into the mix—resulting in setups that ran nowhere. And when I asked it to test something, it would run part in docker and part in K8S and get itself easily confused.&lt;/p&gt;

&lt;p&gt;The fix? Separate everything. I now keep local and production infra in completely different directories and make it painfully clear to the AI which world we’re in before it writes a single line.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;├── kubernetes
│   ├── base
│   │   ├── configmaps
│   │   │   ├── app-config.yaml
│   │   │   └── app-secrets.yaml
│   │   ├── database
│   │   │   ├── postgres-configmap.yaml
│   │   │   ├── postgres-deployment.yaml
│   │   │   ├── postgres-pvc.yaml
│   │   │   └── postgres-service.yaml
│   │   ├── deployments
│   │   │   ├── accounts-service-deployment.yaml
│   │   │   ├── api-gateway-deployment.yaml
│   │   │   ├── frontend-deployment.yaml
│   │   │   ├── transactions-service-deployment.yaml
│   │   │   └── user-service-deployment.yaml
│   │   ├── ingress
│   │   │   ├── frontend-ingress-alternative.yaml
│   │   │   └── frontend-ingress.yaml
│   │   ├── kustomization.yaml
│   │   ├── namespace
│   │   │   └── namespace.yaml
│   │   └── services
│   │       ├── accounts-service-service.yaml
│   │       ├── api-gateway-service.yaml
│   │       ├── frontend-service-nodeport.yaml
│   │       ├── frontend-service.yaml
│   │       ├── transactions-service-service.yaml
│   │       └── user-service-service.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  OpenTelemetry Overload
&lt;/h3&gt;

&lt;p&gt;Then came observability. I trusted the AI to set up tracing across Node.js and Spring Boot services. Big mistake. It pulled in deprecated Node OTel APIs, tried to auto- and manually instrument Spring Boot at the same time (hello, duplicate spans), and wrote Jaeger configs that didn’t match my collector.&lt;/p&gt;

&lt;p&gt;Now I predefine &lt;em&gt;exactly&lt;/em&gt; which observability stack I’m using—library names, versions, and all—and paste that into every session so the AI can’t go rogue. If you're not sure, ask the AI to audit what it installed and double check if those are the right versions or the right configs. It realized that it had the wrong configs for Jaeger and recommended installing the OTEL Collector which cleaned up the config quite a bit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzp0eo9zwb6509foydn1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzp0eo9zwb6509foydn1.png" alt="OTEL Architecture after better planning" width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The 1.8GB Node.js Docker Image
&lt;/h3&gt;

&lt;p&gt;This one was a shocker. Here's what the AI generated for our Next.js frontend—a classic case of "it works" without any thought about efficiency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# What the AI built (simplified version)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; node:20&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; package*.json ./&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt;  &lt;span class="c"&gt;# Installs ALL dependencies, including dev ones&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;npm run build
&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 3000&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["npm", "start"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This innocent-looking Dockerfile created a &lt;strong&gt;1.8GB monster&lt;/strong&gt;. The base Node 20 image alone is 1.1GB, then it installed all dev dependencies (including things like TypeScript, ESLint, and testing frameworks that shouldn't be in production), copied the entire source tree, and kept everything.&lt;/p&gt;

&lt;p&gt;I only realized how bad it was when a user casually mentioned, "Your images take forever to start." Sure enough, the startup lag was brutal. The AI had made no attempt to slim things down because I hadn't told it to.&lt;/p&gt;

&lt;p&gt;The fix required explicit instructions about multi-stage builds and production optimization—resulting in a &lt;a href="https://github.com/speedscale/microsvc/commit/optimize-images" rel="noopener noreferrer"&gt;97% size reduction from 1.8GB to ~50MB&lt;/a&gt;. If you don't explicitly demand lean builds, it won't even try.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Wins
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. PLAN.md as a North Star&lt;/strong&gt; – Writing a detailed PLAN.md with every service, API, and today's focus point keeps the AI grounded. Hallucinations dropped by about 80% once I started using this. It's the one file that gives the AI its "map" before it starts building. Also checking things off your plan makes you feel that incremental progress like something is actually getting done around here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Multi-Agent Workflow&lt;/strong&gt; – Sometimes one agent just isn't enough. Rather than relying on a single AI that might have blind spots, I started configuring Claude to "call out" to specialized sub-agents for second opinions—like having a Gemini agent act as fact-checker or a critical thinking agent provide analytical feedback. Each sub-agent gets a clean context window and specialized tooling for their specific role. This approach delivered measurably better results: studies show up to 90% improvement over standalone agents on complex tasks. You're essentially building a specialized team where each AI has a focused expertise rather than asking "a chef to fix a car engine." My friend Shaun wrote more about this approach in &lt;a href="https://proxymock.io/blog/is-your-agent-lying/" rel="noopener noreferrer"&gt;Is Your Agent Lying?&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd3a1acxiiepbzixujlk4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd3a1acxiiepbzixujlk4.jpg" alt="Multi-Agent Workflow In Practice" width="544" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The "Prove It" Step&lt;/strong&gt; – This is where I make the AI prove it tested its own work. Good is having it run a quick self-check and explain what it tested. Better is TDD—writing the tests first, then building to make them pass. Best is when those tests run automatically in CI with hooks that block anything failing from merging. This one change has caught more silly errors than I'd like to admit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Real Traffic Testing with ProxyMock&lt;/strong&gt; – Unit tests are great, but they don't catch integration failures or API contract changes. I started using &lt;a href="https://proxymock.io" rel="noopener noreferrer"&gt;proxymock&lt;/a&gt; to record real production traffic patterns, then replay them against new versions of services. This caught several breaking changes that would have slipped through traditional testing—like when the AI "optimized" a JSON response structure without realizing downstream services depended on the original format. Recording actual traffic patterns and replaying them against every code change became the ultimate safety net for AI-generated modifications.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LATENCY / THROUGHPUT
+--------------------+--------+-------+-------+-------+-------+-------+-------+-------+------------+
|      ENDPOINT      | METHOD |  AVG  |  P50  |  P90  |  P95  |  P99  | COUNT |  PCT  | PER-SECOND |
+--------------------+--------+-------+-------+-------+-------+-------+-------+-------+------------+
| /                  | GET    |  1.00 |  1.00 |  1.00 |  1.00 |  1.00 |     1 | 20.0% |      18.56 |
| /api/numbers       | GET    |  4.00 |  4.00 |  4.00 |  4.00 |  4.00 |     1 | 20.0% |      18.56 |
| /api/rocket        | GET    |  4.00 |  4.00 |  4.00 |  4.00 |  4.00 |     1 | 20.0% |      18.56 |
| /api/rockets       | GET    |  4.00 |  5.00 |  5.00 |  5.00 |  5.00 |     1 | 20.0% |      18.56 |
| /api/latest-launch | GET    | 34.00 | 34.99 | 34.99 | 34.99 | 34.99 |     1 | 20.0% |      18.56 |
+--------------------+--------+-------+-------+-------+-------+-------+-------+-------+------------+

1 PASSED CHECKS
 - check "requests.response-pct != 100.00" was not violated - observed requests.response-pct was 100.00
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Was It Worth It?
&lt;/h2&gt;

&lt;p&gt;As a startup co-founder, my world isn’t measured in billable hours—it’s measured in how quickly we can get something in people’s hands, learn from it, and ship the next iteration. The banking demo wasn’t just an experiment; it was a race against the clock to have something ready for KubeCon India.&lt;/p&gt;

&lt;p&gt;We made it. The team presented the project on stage, showing off our “Containerized Time Travel” with traffic replay. It was the perfect proof point that speed and iteration matter more than perfection in the early days.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd09mu608wkbtum99v83u.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd09mu608wkbtum99v83u.jpeg" alt="Pega team presenting at KubeCon India 2025" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can watch their talk here: &lt;a href="https://kccncind2025.sched.com/event/23Ev9/containerized-time-travel-replicating-production-performance-sravanthi-naga-hari-babu-volli-pegasystems?iframe=no" rel="noopener noreferrer"&gt;Containerized Time Travel with Traffic Replay – KubeCon India&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Agent Troubleshooting Checklist
&lt;/h2&gt;

&lt;p&gt;When your AI agent starts spinning its wheels or burning through tokens, stop and check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context overload&lt;/strong&gt;: Is the conversation too long? Start fresh with a clear, focused prompt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vague requirements&lt;/strong&gt;: Did you give it a specific goal or just say "make it better"?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing constraints&lt;/strong&gt;: Have you defined boundaries (tech stack, file structure, performance requirements)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No success criteria&lt;/strong&gt;: How will the AI know when it's done?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool confusion&lt;/strong&gt;: Is it trying to use the wrong approach for the task (e.g., complex Kubernetes for a simple local dev setup)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infinite loops&lt;/strong&gt;: Is it repeatedly "fixing" the same issue? Stop and reframe the problem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope creep&lt;/strong&gt;: Has it started solving problems you didn't ask it to solve?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When in doubt, restart with a PLAN.md that breaks down exactly what you want, then hand it one piece at a time.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I'll Avoid Another $900 Sprint
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Choose a main model and go for their version of an "unlimited" plan. As of August 2025 for example you can get Claude Max for $200 with high limits and no per-API costs.&lt;/li&gt;
&lt;li&gt;The web interfaces are good for building out a plan, have it research and draft the initial plan, which you then hand over to the AI Agent.&lt;/li&gt;
&lt;li&gt;Check the dependencies of your project. The AI tools readily add new libraries, keep it in line with &lt;code&gt;ARCHITECTURE.md&lt;/code&gt;. An easy way to tell is when you check in code see if your &lt;code&gt;pom.xml&lt;/code&gt; or &lt;code&gt;package.json&lt;/code&gt; or &lt;code&gt;go.mod&lt;/code&gt; has new entries.&lt;/li&gt;
&lt;li&gt;Enforce small diffs. Have it make a branch and separate check-in for each change. Then run "/clean" in between steps on your &lt;code&gt;PLAN.md&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Ready to Tame Your AI Agents?
&lt;/h2&gt;

&lt;p&gt;The journey from chaos to control with AI coding agents isn't about avoiding them—it's about learning to tame them. With the right approach, these tools can accelerate your development without draining your bank account.&lt;/p&gt;

&lt;p&gt;I'd love to hear your story. What's the most expensive lesson you've learned with AI coding agents? Share it—we might just build the ultimate survival guide together.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>coding</category>
    </item>
    <item>
      <title>Record API calls in prod, replay in dev to test</title>
      <dc:creator>Ken Ahrens</dc:creator>
      <pubDate>Sun, 28 Jul 2024 20:07:26 +0000</pubDate>
      <link>https://dev.to/kenahrens/record-api-calls-in-prod-replay-in-dev-to-test-3knd</link>
      <guid>https://dev.to/kenahrens/record-api-calls-in-prod-replay-in-dev-to-test-3knd</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Have you ever experienced the problem where your code is broken in production, but everything runs correctly in your dev environment? This can be really challenging because you have limited information once something is in production, and you can’t easily make changes and try different code. Speedscale production data simulation lets you securely capture the production application traffic, normalize the data, and replay it directly in your dev environment.red&lt;/p&gt;

&lt;p&gt;There are a lot of challenges with trying to replicate the production environment in non-prod:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data&lt;/strong&gt; - Production has much more data and a much wider variety than non-prod&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Third Parties&lt;/strong&gt; - It’s not always possible to integrate non-prod with third party sandboxes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale&lt;/strong&gt; - The scale of non-prod environment is typically just a fraction of production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By using production data simulation, you can bring the realistic data and scale from production back into the non-prod dev and staging environments. Like any good process implementing Speedscale boils down to 3 simple steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Record&lt;/strong&gt; - utilize the Speedscale sidecar to capture traffic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyze&lt;/strong&gt; - identify the exact set of calls you want to replicate from prod into dev &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay&lt;/strong&gt; - utilize the Speedscale operator to run the traffic against your dev cluster&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;“Works on my machine” -Henry Ford (not a real quote)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Record
&lt;/h2&gt;

&lt;p&gt;In order to capture traffic from your production cluster, you’re going to want to install the operator (&lt;a href="https://github.com/speedscale/operator-helm" rel="noopener noreferrer"&gt;helm chart&lt;/a&gt; is usually the preferred method). During the installation, don’t forget to configure the Data Loss Prevention (DLP) to identify sensitive fields you want to mask, a good example is the HTTP Authentication header. Configuring DLP is as easy as these settings in your &lt;code&gt;values.yaml&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Data Loss Prevention settings.&lt;/span&gt;
dlp:
    enabled: &lt;span class="nb"&gt;true
    &lt;/span&gt;config: &lt;span class="s2"&gt;"standard"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you have the operator installed, then annotate the workload you’d like to record, for example if you have an nginx deployment, you can run something like this (or the GitOps equivalent if you prefer):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl annotate deployment nginx sidecar.speedscale.com/inject&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check and make sure your pod got the sidecar added, you should see an additional container. &lt;/p&gt;

&lt;p&gt;⚡ Note there are additional &lt;a href="https://docs.speedscale.com/setup/sidecar/sidecar-annotations/" rel="noopener noreferrer"&gt;configuration options&lt;/a&gt; as needed for more complex use cases&lt;/p&gt;

&lt;h2&gt;
  
  
  Analyze
&lt;/h2&gt;

&lt;p&gt;Now that you have the sidecar, you should see the service show up in Speedscale. At a glance you’re able to see how much traffic your service is handling, and what are the real backend systems it relies upon. For example our service needs data in DynamoDB and real connections to Stripe and Plaid to work. In a corporate dev environment this kind of access may not be properly configured. Fortunately with Speedscale, we will be able to replicate even these third-party APIs into our dev cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm25rywxrxi7994x2onda.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm25rywxrxi7994x2onda.png" alt="API Service Map" width="800" height="225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Drilling down further into the data you can see all the details of the calls, including the fact that the Authorization data has been redacted. There is a ton of data available, and it’s totally secure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq62dxe26ee6tzduyuzs0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq62dxe26ee6tzduyuzs0.png" alt="API Transaction Details" width="800" height="203"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Set the right time range for your data and add some filters to make sure you include just the traffic that you want to replay. Finally hit the &lt;code&gt;Record&lt;/code&gt; button to complete the analysis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwo6qjwabndlirwddgszm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwo6qjwabndlirwddgszm.png" alt="API traffic filtering" width="800" height="149"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Replay
&lt;/h2&gt;

&lt;p&gt;Just like during the record step, you will want to make sure the Speedscale operator is installed in your dev cluster. You can use the same helm chart install as previous, but remember to give your cluster a new name like &lt;code&gt;dev-cluster&lt;/code&gt; or whatever is your favorite name.&lt;/p&gt;

&lt;p&gt;The wizard lets you pick and choose which ingress and egress services you want to replay in your dev cluster. This is how you’ll solve the problem for not having the right data in DynamoDB, or how to provide the Stripe and Plaid responses even if you don’t have it configured in the dev cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpx1fa8js4vye2mlvmwlq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpx1fa8js4vye2mlvmwlq.png" alt="Traffic-based service mocks" width="800" height="551"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Finally you can take the traffic you’ve selected and replay it locally in your non-prod dev cluster. Speedscale takes care of normalizing the traffic and modifying the workload so that a full production simulation takes place. The code you have running will behave just the same way it does under production conditions because the same kinds of API traffic and data are being used.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb45v7bufj1keimggh0yz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb45v7bufj1keimggh0yz.png" alt="Destination cluster" width="534" height="550"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When the traffic replay is complete, you’ll get a nice report to understand how the traffic behaved in your dev cluster, you can even change configurations and easily replay this traffic again.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fidbfi3nnu10aaq9v5kpg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fidbfi3nnu10aaq9v5kpg.png" alt="Traffic replay results" width="800" height="312"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;You now have the ability to replay this traffic in any environment where you need it: development clusters, CI/CD systems, staging or user acceptance environments. This lets you re-create production conditions, run experiments, validate code fixes, and have much higher confidence before pushing these fixes to production. If you are interested in validating this for yourself, feel free to &lt;a href="https://docs.speedscale.com/guides/replay/guide_other_cluster/" rel="noopener noreferrer"&gt;learn more here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>testing</category>
      <category>microservices</category>
    </item>
    <item>
      <title>Testing LLMs for Performance with Service Mocking</title>
      <dc:creator>Ken Ahrens</dc:creator>
      <pubDate>Tue, 26 Mar 2024 22:15:12 +0000</pubDate>
      <link>https://dev.to/kenahrens/testing-llms-for-performance-with-service-mocking-4ki6</link>
      <guid>https://dev.to/kenahrens/testing-llms-for-performance-with-service-mocking-4ki6</guid>
      <description>&lt;p&gt;While incredibly powerful, one of the challenges when building an LLM application (large language model) is dealing with performance implications. However one of the first challenges you'll face when testing LLMs is that there are many evaluation metrics. For simplicity let's take a look at this through a few different test cases for testing LLMs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capability Benchmarks&lt;/strong&gt; - how well can the model answer prompts?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Training&lt;/strong&gt; - what are the costs and time required to train and fine tune models?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency and Throughput&lt;/strong&gt; - how fast will the model respond in production?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A majority of the software engineering blogs you’ll find related to LLM software testing cover capabilities and training. However the reality is that these are edge cases and you'll likely call a 3rd party API to get a response, it's that vendor's job to handle capabilities and training. What you’re left with is figuring out performance testing— how to improve the latency and throughput— which is the focus of the majority of this article.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Capability Benchmarks&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Here is an example of a recent benchmark test suite from Anthropic about the comparison of the Claude models compared with generative AI models from OpenAI and Google. These capability benchmarks help you understand how accurate the responses are at tasks like getting a correct answer to a math problem or code generation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4pikbjq3y2nqkyhee0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4pikbjq3y2nqkyhee0d.png" alt="Claude benchmarks Anthropic" width="800" height="710"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Source: &lt;a href="https://www.anthropic.com/news/claude-3-family"&gt;https://www.anthropic.com/news/claude-3-family&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The blog is incredibly compelling, however it's all functional testing— there is little performance testing considerations such as expected latency or throughput. The phrase "real-time" is used however specific latency is not measured. The rest of this blog will cover some techniques to get visibility into latency, throughput and various ways to validate how your code will perform against model behavior.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Model Training&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If you run searches to learn about LLM, much of the content is related to getting access to GPUs so you can do your machine learning training. Thankfully however there has been so much effort and capital that has been put into machine learning training that most "AI applications" utilize existing models that have already been well trained. Your AI applications might be able to take advantage of an existing model and simply fine tune it on some aspects of your own proprietary data. For the purposes of this blog we will assume your AI systems have already been properly trained and you’re ready to install it in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Latency, Throughput and SRE Golden Signals&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In order to understand how well your application can scale, you can focus on the SRE golden signals as established in the &lt;a href="https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals"&gt;Google SRE Handbook&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt; is the response time of your application, usually expressed in milliseconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt; is how many transactions per second or minute your application can handle&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Errors&lt;/strong&gt; is usually measured in a percent of&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Saturation&lt;/strong&gt; is the ability of your application to use the available CPU and Memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before you put this LLM into production, you want to get a sense for how your application will perform under load. This starts by getting visibility into the specific endpoints and then driving load throughout the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Basic Demo App&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;For the purposes of this blog, I threw together a quick demo app that uses OpenAI chat completion and image generation models. These have been incorporated into a demo website to add a little character and fun to an otherwise bland admin console.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Chat Completion Data&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This welcome message uses some prompt engineering with the OpenAI chat completion API to welcome new users. Because this call happens on the home page, it needs to have low latency performance to enable quick user feedback:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpltazwy91vclse97kre1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpltazwy91vclse97kre1.png" alt="Chat welcome message" width="800" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Image Generation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To spice things up a little bit, the app also lets users generate some example images for their profile. This is one of the really powerful capabilities of a large language model but you’ll quickly see these are much more expensive and can take a lot longer to respond. You can’t put this kind of call on the home page for sure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lgkeu3cimi6dvoxet7c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lgkeu3cimi6dvoxet7c.png" alt="unicorn ai image" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is an example of an image generated by DALL-E 2 of a unicorn climbing a mountain and jumping onto a rainbow. You're welcome.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Validating Application Signals&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Now that we have our LLM selected and demo application, we want to start getting an idea of how it scales out with the SRE golden signals. To do this, I turned to a product called &lt;a href="https://speedscale.com/"&gt;Speedscale&lt;/a&gt; which allows me to listen to Kubernetes traffic and modify/replay the traffic in dev environments, so. I can simulate different conditions at will.  The first step is to install a &lt;a href="https://docs.speedscale.com/setup/sidecar/install/"&gt;Speedscale sidecar&lt;/a&gt; to capture API interactions running into and out of my user microservice. This lets us start confirming how well this application will scale once it hits a production environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Measuring LLM Latency&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Now that we have our demo app, we want to start understanding the latency in making calls to OpenAI as part of an interactive web application. Using Speedscale Traffic Viewer, at a glance you can see the response time of the 2 critical inbound service calls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;Welcome&lt;/strong&gt; endpoint is responding at 1.5 seconds&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Image&lt;/strong&gt; endpoint takes nearly 10 seconds to respond&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhm34amgt9ywh0ebsvafe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhm34amgt9ywh0ebsvafe.png" alt="speedscale llm transaction latency" width="800" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Always compare these response times to your application scenarios. While the image call is fairly slow, it’s not called on the home page so may not be as critical to the overall application performance. The welcome chat however takes over 1 second to respond, so you should ensure the webpage does not wait for this response before loading.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Comparing LLM Latency to Total Latency&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;By drilling down further into each of the calls, you can find that about 85 - 90% of the time is spent waiting on the LLM to respond. This is by using the standard out of the box model with no additional fine tuning. It's fairly well known that fine tuning your model can improve the quality of the responses but will sacrifice latency and often cost a lot more as well. If you are doing a lot of fine tuning of your models, these validation steps are even more critical.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Validating Responses to Understand Error Rate&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The next challenge you may run into is that you want to test your own code and the way it interacts with the external system. By generating a snapshot of traffic, you can replay and compare how the application responds compared with what is expected. It's not a surprise to see that each time the LLM is called, it responds with slightly different data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzl5i0x7y93h8xlgicw1v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzl5i0x7y93h8xlgicw1v.png" alt="llm response variation" width="800" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While having dynamic responses is incredibly powerful, it's a useful reminder that the LLM is not designed to be deterministic. If your software development uses a continuous integration/continuous deployment pipeline, you want to come up with some way to make the responses consistent based on the inputs. This is one of &lt;a href="https://docs.speedscale.com/concepts/service_mocking/"&gt;Service Mocking&lt;/a&gt;'s best use cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Comparing Your Throughput to Rate Limits&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;After running just 5 virtual users through the application, I was surprised to see the failure rate spike from rate limits. While this load testing is helpful so you don't inadvertently run up your bill, it also has a side effect that you can't learn the performance of your own code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi3gyifumg47hawm6fqv6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi3gyifumg47hawm6fqv6.png" alt="speedscale catching llm rate limit error" width="800" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is another good reason to implement a service mock so that you can do load testing without making your bill spike off the charts like traditional software testing would experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Comparing Rate Limits to Expected Load&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You should be able to plan out which API calls are made on which pages and compare against the expected rate limits. You can confirm your account’s rate limits in the &lt;a href="https://platform.openai.com/docs/guides/rate-limits"&gt;OpenAI docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgqo1i5wtr1bmf9j7h65i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgqo1i5wtr1bmf9j7h65i.png" alt="chat tpm limits" width="800" height="529"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Fortunately OpenAI will let you pay more money to increase these limits. However, just running a handful of tests multiple times can quickly run up a bill into thousands of dollars. And remember, this is just non-prod. What you should do instead is create some service mocks and isolate your code from this LLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Mocking the LLM Backend&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Because the Speedscale sidecar will automatically capture both the inbound and outbound traffic, the outbound data that can be turned into service mocks.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Building a Service Mock&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Find the interesting traffic showing both the inbound and outbound calls you’re interested in and simply hit the Save button. Within a few seconds you will have generated a suite of tests and backend mocks without ever writing any scripts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgexg6k3kglrisgpcfwqm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgexg6k3kglrisgpcfwqm.png" alt="speedscale traffic viewer" width="800" height="486"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Replaying a Service Mock&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Speedscale has built-in support for service mocking of backend downstream systems. When you are ready to replay the traffic you simply check the box for the traffic you would like to mock. There is no scripting or coding involved, the data and latency characteristics you recorded will be replayed automatically.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2onfbtni0b930nltgejg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2onfbtni0b930nltgejg.png" alt="speedscale service mocking" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using service mocks lets you decouple your application code from the downstream LLM and helps you understand the throughput that your application can handle. And as an added bonus, you can test the service mock as much as you want without hitting a rate limit and no per-transaction cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Confirming Service Mock Calls&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You can see all the mocked out calls at a glance on the mock tab of the test report. This is a helpful way to confirm that you’ve isolated your code from external systems which may be adding too much variability to your scenario.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiz2x926x9eu4gnb42ala.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiz2x926x9eu4gnb42ala.png" alt="speedscale endpoints" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You usually want to have 100% match rate on the mock responses, but in case something is not matching as expected, drill into the specific call to see the reason why. There is a rich &lt;a href="https://docs.speedscale.com/concepts/transforms/"&gt;transform system&lt;/a&gt; that is a good way to customize how traffic is observed and ensure the correct response is returned by the mock.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Running Load&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Now that you have your environment running with service mocks, you want to crank up the load to get an understanding of just how much traffic your system can handle.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Test Config&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once the traffic is ready, you can customize how many copies you’ll run and how quickly by customizing your &lt;a href="https://docs.speedscale.com/concepts/test_config/"&gt;Test Config&lt;/a&gt;. It’s easy to ramp up the users or set a target throughput goal.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyhwk5qajnjr01bx7yee.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyhwk5qajnjr01bx7yee.png" alt="speedscale replay conig" width="800" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where you should be experimenting with a wide variety of settings. Set it to the number of users you expect to see to make sure you know the number of replicas you should run. Then crank up the load another 2-3x to see if the system can handle the additional stress.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Test Execution&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Running the scenario is as easy as combining your workload, your snapshot of traffic and the specific test config. The more experiments you run, the more likely you are to get a deep understanding of your latency profile.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmln6cs7zfvaccxfmvik.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmln6cs7zfvaccxfmvik.png" alt="speedscale execution summary" width="698" height="620"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The scenarios should definitely build upon each other. Start with a small run and your basic settings to ensure that the error rate is within bounds. Before you know it you’ll start to see the break points of the application.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Change Application Settings&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You’re not only limited to changing your load test configuration, you also should experiment with different memory, cpu, replica or node configurations to try to squeeze out extra performance. Make sure you track each change over time so you can find the ideal configuration for your production environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp7l2jebcmfxi4micb9ld.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp7l2jebcmfxi4micb9ld.png" alt="speedscale performance reports" width="800" height="291"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In my case, one simple change was to expand the number of replicas which cut way down on the error rate. The system could handle significantly more users and the error rate was within my goal range.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Sprinkle in some Chaos&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once you have a good understanding of the latency and throughput characteristics you may want to &lt;a href="https://docs.speedscale.com/concepts/chaos/"&gt;inject some chaos&lt;/a&gt; in the responses to see how the application will perform. By making the LLM return errors or stop responding altogether you can sometimes find aspects of the code which may fall down under failure conditions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5v1m5vowruar8j70t03z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5v1m5vowruar8j70t03z.png" alt="speedscale chaos configuration" width="790" height="576"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While chaos engineering edge cases is pretty fun, it’s important to ensure you check the results without any chaotic responses first to make sure the application scales under ideal conditions.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Reporting&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once you’re running a variety of scenarios through your application, you’ll start to get a good understanding of how things are scaling out. What kind of throughput can your application handle? How do the various endpoints scale out under additional load?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fki9otjnw8gqodsmrp2s9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fki9otjnw8gqodsmrp2s9.png" alt="speedscale performance metrics" width="800" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At a glance this view gives a good indication of the golden signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt; overall was 1.3s, however it spiked up to 30s during the middle of the run&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt; was unable to scale out consistently and even dropped to 0 at one point&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Errors&lt;/strong&gt; were less than 1% which is really good, just a few of the calls timed out&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Saturation&lt;/strong&gt; of Memory and CPU was good, the app did not become constrained&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Percentiles&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You can dig in even further by looking at the response time percentiles by endpoint to see what the typical user experience was like. For example if you look at the image endpoint, P95 of 8 seconds means that 95% of the users had a response time of 8 seconds or less which really isn’t that great. Even though the average was 6.5 seconds, there were plenty of users that experienced timeouts, so there are still some kinks that need to be worked out of this application related to images.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxhcjoghy0va5glfz3w9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxhcjoghy0va5glfz3w9.png" alt="speedscale latency summary" width="800" height="197"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For even deeper visibility into the response time characteristics you can incorporate an APM (Application Performance Management) solution to understand how to improve the code. However in our case we already know most of the time is spent waiting for the LLM to respond with its clever answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Summary&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;While large language models can bring an enormous boost to your application functionality, you need to ensure that your service doesn’t fall down under the additional load. It’s important to run latency performance profiling in addition to looking at the model capabilities. It's also important to consider avoiding breaking the bank on LLMs in your continuous integration/continuous deployment pipeline. While it can be really interesting to run a model that is incredibly smart with answers, you may want to consider the tradeoff of using a simpler model that can respond to your users more quickly so they stay on your app without closing their browser window. If you'd like to learn more, you can check out a video of this blog in &lt;a href="https://youtu.be/VR6IPJOQPbE?si=oiwANXKqzpXguJrc"&gt;more detail here&lt;/a&gt;. If you want to dig into the world of LLM and how to understand performance, feel free to join the &lt;a href="https://speedscale.com/community/"&gt;Speedscale Community&lt;/a&gt; and reach out, we’d love to hear from you.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>servicemocking</category>
      <category>performancetesting</category>
    </item>
    <item>
      <title>APIs for Beginners</title>
      <dc:creator>Ken Ahrens</dc:creator>
      <pubDate>Thu, 06 Jan 2022 13:28:25 +0000</pubDate>
      <link>https://dev.to/kenahrens/apis-for-beginners-50h9</link>
      <guid>https://dev.to/kenahrens/apis-for-beginners-50h9</guid>
      <description>&lt;p&gt;Are you looking to benefit from automation but lack the experience to leverage an API? To equip you with the tools you need to start utilizing APIs and automation, we’ve put together these helpful Beginner FAQs covering common terminology, methods, and tools for testing APIs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an API?
&lt;/h2&gt;

&lt;p&gt;API stands for Application Programming Interface. An API is a set of programming code that enables data transmission between one software product and another.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does an API Work?
&lt;/h2&gt;

&lt;p&gt;APIs sit between an application and the web server, acting as an intermediary layer that processes data transfer between systems. Here’s how an API works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A client application initiates an API call to retrieve information—also known as a request. This request is processed from an application to the web server via the API’s Uniform Resource Identifier (URI) and includes a request verb, headers, and sometimes, a request body.&lt;/li&gt;
&lt;li&gt;After receiving a valid request, the API makes a call to the external program or web server.&lt;/li&gt;
&lt;li&gt;The server sends a response to the API with the requested information.&lt;/li&gt;
&lt;li&gt;The API transfers the data to the initial requesting application.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What is API Testing?
&lt;/h2&gt;

&lt;p&gt;While there are many aspects of API testing, it generally consists of making requests to a single or sometimes multiple API endpoints and validating the response. The purpose of API testing is to determine if the API meets expectations for functionality, performance, and security.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the most popular kind of API?
&lt;/h2&gt;

&lt;p&gt;The most used API is a RESTful API (Representational State Transfer API). RESTful APIs allow for interoperability between different types of applications and devices on the internet.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is REST?
&lt;/h2&gt;

&lt;p&gt;Representational State Transfer (REST) is a software architectural style that developers apply to web APIs. REST relies on HTTP to transfer information using requests, called ‘URLs’, to return specified data, called ‘resources’, to the client. Resources can take many forms (images, text, data). At a basic level, REST is a call and response model for APIs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a REST API?
&lt;/h2&gt;

&lt;p&gt;A REST API conforms to the design principles of the REST, or representational state transfer architectural style. Restful APIs are extremely simple when it comes to building and scaling as compared to other types of APIs. When these types of APIs are put into action, they help facilitate client-server communications with ease. Because RESTful APIs are simple, they can be the perfect APIs for beginners.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is REST API Testing?
&lt;/h2&gt;

&lt;p&gt;REST API Testing is a web automation testing technique for testing REST-based APIs for web applications without using the user interface. The purpose of REST API testing is to record the response of REST API by sending various HTTP requests to check if REST API is working correctly. You can test a REST API with GET, POST, PUT, PATCH and DELETE methods.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the most Popular Response Data Format?
&lt;/h2&gt;

&lt;p&gt;JSON is the most popular response data format amongst developers. JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write and it’s simple for machines to parse and generate. Plus, JSON is a is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. JSON is widely used due to its lighter payloads, greater readability, reduced machine overhead for Serialization/Deserialization and easier consumption by JavaScript. These properties make JSON an ideal data-interchange language.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Can I Improve My API Testing &amp;amp; Performance?
&lt;/h2&gt;

&lt;p&gt;Speedscale helps operation teams prevent costly incidents by validating how new code will perform under production-like workload conditions. Site Reliability Engineers use Speedscale to measure the golden signals of latency, throughput and errors before the code is released. Speedscale Traffic Replay is an alternative to legacy API testing approaches which take days or weeks to run and do not scale well for modern architectures.&lt;/p&gt;

&lt;p&gt;Now that you know some of the basics of APIs and API testing methods, you’re one step closer to being able to leverage the full power of API automation. &lt;a href="https://speedscale.com/api-testing/"&gt;Learn how Speedscale’s solutions can help improve your API testing &amp;amp; performance&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>microservices</category>
      <category>api</category>
    </item>
  </channel>
</rss>
