<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Audexum</title>
    <description>The latest articles on DEV Community by Audexum (@audexum).</description>
    <link>https://dev.to/audexum</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3954347%2Ffb8fe96e-640c-42ea-a86b-3f3c831e0d9b.png</url>
      <title>DEV Community: Audexum</title>
      <link>https://dev.to/audexum</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/audexum"/>
    <language>en</language>
    <item>
      <title>I Built a TTS API That's 4 Cheaper Than ElevenLabs — Here's the Tech Stack and Pricing Math</title>
      <dc:creator>Audexum</dc:creator>
      <pubDate>Wed, 27 May 2026 12:45:50 +0000</pubDate>
      <link>https://dev.to/audexum/i-built-a-tts-api-thats-4x-cheaper-than-elevenlabs-heres-the-tech-stack-and-pricing-math-12d4</link>
      <guid>https://dev.to/audexum/i-built-a-tts-api-thats-4x-cheaper-than-elevenlabs-heres-the-tech-stack-and-pricing-math-12d4</guid>
      <description>&lt;p&gt;Why I Built It&lt;/p&gt;

&lt;p&gt;I needed text-to-speech for my own SaaS projects. I looked at ElevenLabs and OpenAI TTS, ran the numbers, and immediately started looking for&lt;br&gt;
  alternatives. At scale — even modest scale — those prices compound fast.&lt;/p&gt;

&lt;p&gt;ElevenLabs charges around $0.30/1K characters on their lowest paid tier. OpenAI TTS is roughly $0.015/1K characters for the standard model.&lt;br&gt;
  Neither felt justified when I knew the actual cost of running inference on a GPU I already owned.&lt;/p&gt;

&lt;p&gt;So I built Audexum (&lt;a href="https://audexum.com" rel="noopener noreferrer"&gt;https://audexum.com&lt;/a&gt;) — a TTS REST API with 43 voices across 33 languages, priced at what I think the market actually supports&lt;br&gt;
  rather than what VC-backed companies need to charge to cover their runway.&lt;/p&gt;

&lt;p&gt;This is a write-up of how I built it, what surprised me, and what I'd do differently.&lt;/p&gt;




&lt;p&gt;The Tech Stack&lt;/p&gt;

&lt;p&gt;The backend is FastAPI with SQLAlchemy async over PostgreSQL. I chose async SQLAlchemy because the API has a mix of quick metadata queries and&lt;br&gt;
  longer-running inference jobs, and I wanted a single event loop handling both without threads. The async driver (asyncpg under the hood) makes&lt;br&gt;
  connection pooling significantly simpler to reason about.&lt;/p&gt;

&lt;p&gt;Caddy handles TLS termination. I use Cloudflare DNS-01 challenge for certificate issuance, which means port 80 never needs to be open. The server&lt;br&gt;
  sits behind a firewall with only 443 reachable publicly — DNS-01 validation happens entirely through Cloudflare's API. If you're running a home&lt;br&gt;
  server or a VPS with port 80 blocked by your host, this is the correct approach.&lt;/p&gt;

&lt;p&gt;The frontend is React + Vite. Nothing unusual there — it's a documentation site, a dashboard, and a playground. Vite's dev server proxy makes&lt;br&gt;
  local development against a FastAPI backend painless.&lt;/p&gt;

&lt;p&gt;Stripe handles billing. Resend handles transactional email (signup confirmation, API key delivery, usage alerts). Both have decent Python SDKs and&lt;br&gt;
   webhook support that works reliably.&lt;/p&gt;

&lt;p&gt;The TTS model itself is Supertonic-3 running as an ONNX graph on CUDA. The model weights sit at around 1.2 GB VRAM, leaving headroom on an RTX&lt;br&gt;
  3090 (24 GB) for batching and model warm-up. ONNX inference on GPU is faster than PyTorch eager mode for fixed-architecture models and sidesteps&lt;br&gt;
  torch version compatibility headaches in production.&lt;/p&gt;

&lt;p&gt;API authentication uses Bearer tokens with sk_live_ prefixed keys — the same convention most developers already know from Stripe and OpenAI. Less&lt;br&gt;
  cognitive load when integrating.&lt;/p&gt;

&lt;p&gt;Output is WAV. I added AI Act Article 50 compliant watermarking in the WAV metadata — a tamper-evident signal embedded in the file header. It's a&lt;br&gt;
  legal requirement for AI-generated audio in the EU and takes about 15 lines of code to implement correctly.&lt;/p&gt;




&lt;p&gt;Pricing Math: One RTX 3090 Serving Hundreds of Users&lt;/p&gt;

&lt;p&gt;Here are the actual numbers.&lt;/p&gt;

&lt;p&gt;An RTX 3090 pulls around 350W under full inference load. At €0.12/kWh (European average), that's:&lt;/p&gt;

&lt;p&gt;350W × 24h = 8.4 kWh/day × €0.12 = ~€1.01/day in electricity&lt;/p&gt;

&lt;p&gt;Supertonic-3 generates roughly 80–120 characters of speech per second on the 3090. Call it 100 chars/sec as a conservative average across voice&lt;br&gt;
  types.&lt;/p&gt;

&lt;p&gt;100 chars/sec × 3600 sec/hr × 24 hr = 8,640,000 chars/day theoretical max&lt;/p&gt;

&lt;p&gt;At the Scale plan price (€30 per 2M characters), that theoretical daily throughput is worth:&lt;/p&gt;

&lt;p&gt;8.64M chars / 2M × €30 = €129.60/day in revenue at 100% utilization&lt;/p&gt;

&lt;p&gt;Nobody runs at 100% utilization. Real-world API traffic is spiky and bursty. But even at 10% utilization:&lt;/p&gt;

&lt;p&gt;€12.96/day revenue vs €1.01/day electricity cost = profitable at low single-digit utilization&lt;/p&gt;

&lt;p&gt;The actual server cost (amortized hardware + hosting) matters more than electricity at this scale. But the point is: a single consumer GPU can&lt;br&gt;
  serve hundreds of paying users on realistic usage patterns.&lt;/p&gt;

&lt;p&gt;┌─────────┬────────┬────────────┬──────────────┐&lt;br&gt;
  │  Plan   │ Price  │ Characters │ Per 1M chars │&lt;br&gt;
  ├─────────┼────────┼────────────┼──────────────┤&lt;br&gt;
  │ Free    │ €0     │ 10K/mo     │ —            │&lt;br&gt;
  ├─────────┼────────┼────────────┼──────────────┤&lt;br&gt;
  │ Starter │ €4/mo  │ 100K/mo    │ €40          │&lt;br&gt;
  ├─────────┼────────┼────────────┼──────────────┤&lt;br&gt;
  │ Pro     │ €12/mo │ 500K/mo    │ €24          │&lt;br&gt;
  ├─────────┼────────┼────────────┼──────────────┤&lt;br&gt;
  │ Scale   │ €30/mo │ 2M/mo      │ €15          │&lt;br&gt;
  ├─────────┼────────┼────────────┼──────────────┤&lt;br&gt;
  │ PAYG    │ €3     │ 1M         │ €3           │&lt;br&gt;
  └─────────┴────────┴────────────┴──────────────┘&lt;/p&gt;

&lt;p&gt;For comparison, ElevenLabs' equivalent tier runs ~€60+ for 500K characters. OpenAI TTS is cheaper than ElevenLabs but still 5× the Scale plan rate&lt;br&gt;
   here.&lt;/p&gt;




&lt;p&gt;Three Things That Bit Me&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Passlib + bcrypt 5.x Is Broken — Use bcrypt Directly&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I started with Passlib for password hashing because it's the standard FastAPI recommendation. It works fine until bcrypt releases a major version.&lt;br&gt;
   Passlib hasn't kept up with bcrypt's API changes, and the result is silent failures or cryptic AttributeError exceptions at runtime.&lt;/p&gt;

&lt;p&gt;The fix: drop Passlib entirely and call bcrypt directly.&lt;/p&gt;

&lt;p&gt;import bcrypt&lt;/p&gt;

&lt;p&gt;def hash_password(password: str) -&amp;gt; str:&lt;br&gt;
      return bcrypt.hashpw(password.encode(), bcrypt.gensalt()).decode()&lt;/p&gt;

&lt;p&gt;def verify_password(plain: str, hashed: str) -&amp;gt; bool:&lt;br&gt;
      return bcrypt.checkpw(plain.encode(), hashed.encode())&lt;/p&gt;

&lt;p&gt;Fewer dependencies, no version mismatch risk, same security. If you're starting a new FastAPI project today, skip Passlib and go straight to&lt;br&gt;
  bcrypt.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stripe EUR Amounts in Test Mode Round Differently Than You Expect&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Stripe stores amounts as integers in the smallest currency unit (cents for EUR). When you create a price object programmatically in test mode,&lt;br&gt;
  floating point creeps in if you're not careful.&lt;/p&gt;

&lt;p&gt;# Wrong — floating point precision issue&lt;br&gt;
  amount_cents = plan_price_eur * 100  # 3.99 * 100 = 398.99999...&lt;/p&gt;

&lt;p&gt;# Correct&lt;br&gt;
  amount_cents = round(plan_price_eur * 100)&lt;/p&gt;

&lt;p&gt;This only showed up in test mode because Stripe's test mode applies additional validation that production mode doesn't. Always use round() before&lt;br&gt;
  passing amounts to Stripe.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;asyncio.Semaphore(1) Is the Correct GPU Concurrency Fix&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;ONNX inference on CUDA is not thread-safe for concurrent requests. If two requests hit the inference endpoint simultaneously, you get a CUDA OOM&lt;br&gt;
  or a CUDA illegal memory access, both of which crash the process.&lt;/p&gt;

&lt;p&gt;from asyncio import Semaphore&lt;/p&gt;

&lt;p&gt;gpu_semaphore = Semaphore(1)&lt;/p&gt;

&lt;p&gt;async def synthesize(text: str, voice_id: str) -&amp;gt; bytes:&lt;br&gt;
      async with gpu_semaphore:&lt;br&gt;
          audio = await run_in_executor(None, model.run, text, voice_id)&lt;br&gt;
      return audio&lt;/p&gt;

&lt;p&gt;Semaphore(1) means only one inference runs at a time. Requests queue behind it. For a single-GPU server this is correct behavior — inference is&lt;br&gt;
  fast enough that queue wait times are low.&lt;/p&gt;




&lt;p&gt;Calling the API&lt;/p&gt;

&lt;p&gt;curl:&lt;/p&gt;

&lt;p&gt;curl -X POST &lt;a href="https://audexum.com/api/tts" rel="noopener noreferrer"&gt;https://audexum.com/api/tts&lt;/a&gt; \&lt;br&gt;
    -H "Authorization: Bearer sk_live_xxx" \&lt;br&gt;
    -H "Content-Type: application/json" \&lt;br&gt;
    -d '{"text": "Hello world", "voice_id": "af_heart", "format": "wav"}' \&lt;br&gt;
    --output output.wav&lt;/p&gt;

&lt;p&gt;Python:&lt;/p&gt;

&lt;p&gt;import requests&lt;/p&gt;

&lt;p&gt;response = requests.post(&lt;br&gt;
      "&lt;a href="https://audexum.com/api/tts" rel="noopener noreferrer"&gt;https://audexum.com/api/tts&lt;/a&gt;",&lt;br&gt;
      headers={"Authorization": "Bearer sk_live_xxx"},&lt;br&gt;
      json={"text": "Hello world", "voice_id": "af_heart", "format": "wav"},&lt;br&gt;
  )&lt;/p&gt;

&lt;p&gt;with open("output.wav", "wb") as f:&lt;br&gt;
      f.write(response.content)&lt;/p&gt;

&lt;p&gt;Node.js:&lt;/p&gt;

&lt;p&gt;const fs = require("fs");&lt;/p&gt;

&lt;p&gt;const response = await fetch("&lt;a href="https://audexum.com/api/tts" rel="noopener noreferrer"&gt;https://audexum.com/api/tts&lt;/a&gt;", {&lt;br&gt;
    method: "POST",&lt;br&gt;
    headers: {&lt;br&gt;
      "Authorization": "Bearer sk_live_xxx",&lt;br&gt;
      "Content-Type": "application/json",&lt;br&gt;
    },&lt;br&gt;
    body: JSON.stringify({ text: "Hello world", voice_id: "af_heart", format: "wav" }),&lt;br&gt;
  });&lt;/p&gt;

&lt;p&gt;fs.writeFileSync("output.wav", Buffer.from(await response.arrayBuffer()));&lt;/p&gt;




&lt;p&gt;What's Next&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Streaming audio output (chunked transfer for low-latency playback)&lt;/li&gt;
&lt;li&gt;MP3 and OGG output formats&lt;/li&gt;
&lt;li&gt;Voice cloning from a reference audio clip&lt;/li&gt;
&lt;li&gt;Batch endpoint — multiple texts in, ZIP of WAV files out&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Try It&lt;/p&gt;

&lt;p&gt;Free tier is 10,000 characters per month, no credit card required.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://audexum.com" rel="noopener noreferrer"&gt;https://audexum.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you build something with it, I'm curious what you're using TTS for.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>api</category>
      <category>ai</category>
      <category>voice</category>
    </item>
  </channel>
</rss>
