As a software developer who has struggled to build an active English speaking habit, I’ve long felt frustrated with existing language apps. Most speaking apps are built on a repetitive "tap-to-record" loop: you speak, wait for an audio chunk to upload, let an API parse it, wait, and get feedback. It feels less like a conversation and more like sending voice notes back and forth with a slow robot.
I wanted to build an experience that felt like a real, flowing, live conversation. That meant two major technical requirements:
- Sub-second audio latency so the user can speak and receive an immediate spoken response.
- Interrupt-anytime capability so the user can naturally cut in with a thought when the tutor is speaking.
To solve this, I built Fluentio.app. In this article, I’ll dive into the architecture, the technical choices, and how I leveraged browser-native WebRTC direct connection to OpenAI’s Realtime API to build an ultra-responsive, credit-metered language tutor.
1. High-Level Architecture
Fluentio is structured as a TypeScript monorepo containing the frontend client, backend server, and E2E test suites.
Here is how the components communicate:
The Monorepo Layout
-
fluentio-client/: The frontend app, an Angular Single Page Application configured as an installable Progressive Web App (PWA). -
fluentio-server/: The Node.js Express backend serving static assets, SEO-friendly landing pages via Nunjucks template engine, and managing database integration. -
fluentio-test/: Playwright E2E tests, verifying authenticated user scenarios using isolated test runner routines.
2. The Core Tech Stack
| Layer | Technology | Key Details |
|---|---|---|
| Frontend | Angular, RxJS, Tailwind CSS, daisyUI | Configured as an offline-capable PWA via @angular/service-worker
|
| Backend | Express, TypeScript, Nunjucks | High performance, lightweight, serves both public SEO pages and API endpoints |
| Database | PostgreSQL | Strict strong-typing maps with namespaced database schemas |
| Authentication | better-auth | Seamless Google and Apple OAuth, and email credentials management |
| Payments | Dodo Payments | Handles custom pay-as-you-go credit billing, invoice sync, and webhooks |
| AI Integration | OpenAI Realtime API, Responses API, Speech (TTS) | Sub-second voice via WebRTC (gpt-realtime-mini) and structured JSON schema generation |
3. Engineering a Realtime WebRTC Voice Loop
In traditional voice-AI apps, the connection flow looks like this:
[Browser Mic] -> Stream chunks -> [Your Server] -> Buffer & Upload -> [Whisper API] -> Text -> [LLM] -> Text -> [TTS API] -> Audio Stream -> [Your Server] -> [Browser]
This round-trip flow is dead on arrival. Even with aggressive streaming, you are looking at 3 to 5 seconds of latency. It completely breaks conversational flow.
To bypass this bottleneck, Fluentio connects the user’s browser directly to OpenAI’s Realtime API using WebRTC. Audio streaming is managed browser-to-OpenAI, meaning the Node.js server is never in the hot path for media transport.
Step 1: Authentication & Credit Verification (Backend)
When a user kicks off an English session, the Angular client hits the Express backend endpoint /api/ai/session. The backend is responsible for verifying the session and minting an ephemeral client secret.
// fluentio-server/src/controllers/ai.controller.ts
static async createSession(req: Request, res: Response, { openai, dodoPayments, pool }) {
const auth = UserUtil.userAuth(res.locals.session);
// 1. Verify user's remaining credits balance in PostgreSQL via consumption helper
const { availableCredits } = await UserUtil.calculateUserConsumption(auth, dodoPayments);
if (!UserUtil.isAdmin(auth) && availableCredits <= 0) {
return res.status(402).json({ error: "Insufficient credits" });
}
// 2. Fetch the session instructions (dynamic configurations like vocabulary, history, proficiency)
const instructions = agentConfig({ ... });
// 3. Call OpenAI to mint an ephemeral token (WebRTC session config)
const response = await fetch("https://api.openai.com/v1/realtime/client_secrets", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
session: {
type: "realtime",
model: "gpt-realtime-mini-2025-12-15",
instructions,
audio: {
input: {
transcription: { model: "gpt-4o-mini-transcribe-2025-12-15" },
},
output: { voice: "alloy" }
}
}
}),
});
const data = await response.json();
res.json({
client_secret: {
value: data.value,
expires_at: data.expires_at,
}
});
}
Step 2: Directly Establishing the RTCPeerConnection (Client)
On receiving the client_secret (ephemeral key), the Angular frontend spins up a standard browser WebRTC connection:
// fluentio-client/src/app/features/chat/chat.service.ts
async initializeVoiceSession(clientSecret: { value: string }) {
// 1. Set up the WebRTC Connection
this.peerConnection = new RTCPeerConnection();
// 2. Hook up the output audio element
const audioEl = document.createElement("audio");
audioEl.autoplay = true;
this.peerConnection.ontrack = (e) => {
audioEl.srcObject = e.streams[0];
};
// 3. Add the microphone stream as a WebRTC track
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
this.peerConnection.addTrack(mediaStream.getTracks()[0]);
// 4. Create a data channel for high-level events (text transcripts, function calls, system events)
this.dataChannel = this.peerConnection.createDataChannel("oai-events");
this.setupDataChannelListeners(this.dataChannel);
// 5. Offer/Answer handshake directly with OpenAI WebRTC gateway
const offer = await this.peerConnection.createOffer();
await this.peerConnection.setLocalDescription(offer);
const tokenValue = clientSecret.value; // Ephemeral key
const sdpResponse = await fetch("https://api.openai.com/v1/realtime/calls", {
method: "POST",
body: offer.sdp,
headers: {
"Authorization": `Bearer ${tokenValue}`,
"Content-Type": "application/sdp"
}
});
const answerSdp = await sdpResponse.text();
const answer = {
type: "answer" as RTCSdpType,
sdp: answerSdp
};
await this.peerConnection.setRemoteDescription(answer);
}
By connecting the client directly to OpenAI via WebRTC:
- Latency drops to ~500-800ms, which feels exactly like speaking with a human over a clean VoIP call.
- Interrupt-anytime works out of the box because WebRTC is inherently full-duplex. When the browser detects user voice input, OpenAI's server-side voice activity detection (VAD) instantly cancels its output stream and listens to the user.
4. Post-Session Analysis with structured JSON Schema (Responses API)
Once a voice conversation ends, I want to provide the user with deep language insights. This is a complex task: I need to parse the entire dialogue transcript, highlight spelling, grammar, and pronunciation errors, and suggest better alternatives.
To make sure this data is predictable and easy to display on the frontend UI, I enforce type-safe structured output using OpenAI's Responses API (openai.responses.create) with a defined JSON Schema:
// fluentio-server/src/controllers/ai.controller.ts
const analysisParams = {
model: "gpt-4o-mini",
input: [
{ role: "system", content: "You are an English language tutor analyzing a practice transcript..." },
{ role: "user", content: `Transcript:\n${transcript}` }
],
text: {
format: {
type: "json_schema",
name: "session_feedback",
schema: {
type: "object",
properties: {
corrections: {
type: "array",
items: {
type: "object",
properties: {
original: { type: "string" },
corrected: { type: "string" },
explanation: { type: "string" }
},
required: ["original", "corrected", "explanation"]
}
},
badPatterns: {
type: "array",
items: { type: "string" }
},
suggestions: {
type: "array",
items: { type: "string" }
}
},
required: ["corrections", "badPatterns", "suggestions"],
additionalProperties: false
},
strict: true
}
}
};
const feedbackResponse = await openai.responses.create(analysisParams);
const feedbackData = JSON.parse(feedbackResponse.output_text);
On the PostgreSQL database, namespaced tables map exact TypeScript interfaces defined in the shared types file. When the client navigates to their history dashboard, they fetch these exact, clean JSON shapes directly.
5. Lessons Learned & Indie Hacker Decisions
PWA beats Native Apps for Indie Devs
Fluentio was engineered intentionally as a Progressive Web App (PWA). By configuring @angular/service-worker and serving a clean manifest.json, users can "Add to Home Screen" directly from Safari on iOS or Chrome on Android.
This choice saved me weeks of development:
- No native App Store reviews or delays.
- 100% of the code lives in a single TypeScript monorepo.
- I avoid the 15-30% platform tax, integrating directly with Dodo Payments which handles global checkout tax and credit-card merchant of record processing seamlessly.
Co-Existence of Nunjucks & Angular
A common pitfall of Single Page Applications (SPAs) like Angular is poor Search Engine Optimization (SEO) for marketing pages. To circumvent this, the parent Express app serves static, pre-rendered marketing pages (landing, about, pricing, FAQ) utilizing Nunjucks:
// Server handles landing routes via Nunjucks templates
app.get("/", (req, res) => {
res.render("index.html");
});
// All routes under /classroom are directed straight to the Angular Single Page App
app.use("/classroom", express.static(clientAssetsPath));
app.get("/classroom/{*splat}", (req, res) => {
res.sendFile(join(clientAssetsPath, "index.html"));
});
This ensures lightning-fast landing page loading speeds, full search engine crawler indexability, and clean social graph tags while keeping the heavy state-management classroom logic isolated as a robust Angular application.
6. Going Live
Fluentio.app is now live. By leveraging modern WebRTC directly to OpenAI's endpoints and avoiding intermediate streaming servers, I was able to deliver a highly interactive language app with negligible server overhead.
If you're an indie builder interested in WebRTC or want to try out the pay-as-you-go voice sessions, feel free to sign up. All new signups receive 30 free conversation credits (around 30 minutes of real-time voice time), no card required.
If you have any questions about the stack, the WebRTC loop, or want to discuss AI voice architectures, drop a comment below!

Top comments (0)