Vivek V.

for AWS Heroes

Posted on Mar 26 • Originally published at Medium

I Built a Real-Time Voice AI Confession Guide for 1.39 Billion Catholics. Every Bug Was Invisible.

#agentcore #ios #kiro #aws

The Gap

75% of U.S. Catholics say they never go to confession or go less than once a year. Only 2% confess monthly. Those numbers come from Pew Research and CARA at Georgetown, and they've been trending in one direction for decades.

I built Mass Time, an iOS app with 280,000+ churches indexed, 60 prayers, and daily readings. I've watched the confession problem from the data side for years. People want to go. They feel anxious. They don't know what to say. They haven't been in so long that the guilt of not going becomes the reason they keep not going.

In 2025, the late Pope Francis declared a Jubilee Year and specifically called for increased access to the Sacrament of Reconciliation. The demand is real. The infrastructure isn't.

So I built an AI confession guide. No AI will ever replace a priest. Only a priest can grant absolution. This is a preparation tool. You talk to your phone. It talks back. Seven languages. No keyboard. Just a conversation that helps you organize your thoughts before you walk into that booth.

It uses Amazon Nova 2 Sonic, a bi-directional speech-to-speech model, running on Amazon Bedrock AgentCore Runtime. The bugs I hit building it were the worst kind: silent, invisible, and plausible enough to make you doubt everything except the actual problem.

The Architecture

Simple on paper. Brutal in practice.

Nova 2 Sonic handles the heavy lifting. It takes raw audio in, processes speech bidirectionally, and sends audio back. No separate STT or TTS pipeline. One model, both directions, real-time.

The server is a Python 3.13 container on AgentCore Runtime. The client is native Swift with AVAudioEngine. No web views, no React Native. Seven polyglot voices: English (Matthew), French (Florian), Spanish (Carlos), Italian (Lorenzo), German (Lennart), Portuguese (Leo), and Hindi (Arjun). Users can switch languages mid-conversation and the AI follows.

Region constraint worth knowing: Nova 2 Sonic is only available in us-east-1, us-west-2, and ap-northeast-1. I learned the us-east-2 limitation the hard way when my first deployment returned a model-not-found error.

The Bug That Almost Shipped

Nova 2 Sonic never responded to speech. I tried the Strands Agents SDK. Raw API calls. Different prompts. Different voices. Nothing.

The audio RMS values from the mic looked normal. Waveforms had energy. Everything seemed fine.

One line of Swift was wrong.

// WRONG: reads pointer-to-pointer memory (garbage audio)
Data(bytes: outBuf.int16ChannelData!, count: ...)

// CORRECT: reads actual audio samples
Data(bytes: outBuf.int16ChannelData![0], count: ...)

int16ChannelData is UnsafePointer<UnsafeMutablePointer<Int16>>. A pointer to an array of channel pointers. Without [0], you're reading the pointer addresses themselves as audio samples.

The RMS looked plausible because memory addresses happen to have high values. The model received garbage that looked like audio in every metric except the one that mattered: it wasn't audio.

I confirmed it by injecting known-good PCM on the server side. Model responded perfectly. Fixed the iOS code. Real mic audio worked.

One missing array subscript. Hours of debugging.

Two More Invisible Bugs

The app crashed on every AI response with _outputFormat.channelCount == buffer.format.channelCount. AVAudioPlayerNode was connected with the mixer's stereo format but I was scheduling mono buffers. Nova 2 Sonic outputs 24kHz mono PCM. The mixer expects stereo by default.

Fix: connect the player node with an explicit mono format matching the model's output.

let playFmt = AVAudioFormat(
    commonFormat: .pcmFormatInt16,
    sampleRate: 24000,
    channels: 1,
    interleaved: true
)!
audioEngine?.connect(playerNode!, to: audioEngine!.mainMixerNode, format: playFmt)

Key detail that isn't obvious from the docs: Nova 2 Sonic input is 16kHz 16-bit mono PCM, but output is 24kHz 16-bit mono PCM. Different sample rates in each direction.

Then the mic auto-unmute timer never fired. After the AI spoke, the mic stayed muted permanently. No error. No warning. Just silence.

Timer.scheduledTimer was called from the WebSocket callback thread. That thread has no RunLoop. The timer was created, registered to nothing, and quietly ignored. This is one of those iOS gotchas that experienced developers know and everyone else discovers at 2 AM.

The Cognito Session Policy Trap

This one is invisible and will waste your entire day.

Cognito Identity Pools have two auth flows: Enhanced (default) and Basic (classic). Every tutorial uses Enhanced. It works fine for S3, DynamoDB, Lambda.

It does not work for Amazon Bedrock (currently).

Enhanced flow calls getCredentialsForIdentity, which injects a restrictive session policy that limits credentials to a subset of AWS services. Bedrock is not in that subset. Your IAM role policy can be perfect. You'll still get AccessDeniedException.

The error message says "no session policy allows" but doesn't tell you Cognito is injecting it. You can't see this policy in IAM, CloudTrail, or the Cognito console. It's completely invisible.

Fix: one boolean in CDK.

const identityPool = new cognito.CfnIdentityPool(this, 'Pool', {
    allowUnauthenticatedIdentities: true,
    allowClassicFlow: true,  // This is the entire fix
});

On the client side, the classic flow requires three calls instead of one: getId → getOpenIdToken → STS AssumeRoleWithWebIdentity. The STS call is an unsigned HTTP POST, so you don't even need the STS SDK.

One boolean. Hours of debugging. Not documented clearly anywhere. This affects all Bedrock actions, not just Nova 2 Sonic.

Frugal Architecture: 60 Concurrent Sessions, Zero Quota Increases

Werner Vogels talks about the Frugal Architect: cost-aware design as a first-class engineering concern, not an afterthought. This project runs on my personal AWS account. No technical account manager. No need to request quota increases through a support plan. No need for using my "AWS Hero" Card.

Nova 2 Sonic has a default service quota of 20 concurrent InvokeModelWithBidirectionalStream requests per region (check your Service Quotas console under Amazon Bedrock for the exact current value). That's 20 simultaneous confessions per region before you hit throttling.

Instead of requesting a quota increase, I built a queue-based routing system across all three available regions.

REGIONS = ['us-east-1', 'us-west-2', 'ap-northeast-1']
MAX_PER_REGION = 20

def pick_region(counts):
    best = None
    for region in REGIONS:
        c = counts.get(region, 0)
        if c < MAX_PER_REGION and (best is None or c < counts.get(best, 0)):
            best = region
    return best

3 regions × 20 concurrent sessions = 60 simultaneous confession sessions. On default quotas. No support ticket. No TAM call.

The latency difference between regions is negligible for this use case. A confession guide isn't a trading bot. An extra 80ms of round-trip to Tokyo doesn't matter when the conversation has natural 2-second pauses between turns.

Design around the constraints you have instead of asking for the constraints to be removed.

Nova 2 Sonic Configuration That Matters

Turn detection sensitivity set to LOW for confession preparation. That's roughly a 2-second pause before the model responds. You want thoughtful pauses in this context, not rapid-fire conversation.

await send_evt({"event": {"sessionStart": {
    "inferenceConfiguration": {"maxTokens": 1024, "topP": 0.9, "temperature": 0.7},
    "turnDetectionConfiguration": {"endpointingSensitivity": "LOW"}
}}})

Available values: LOW, MEDIUM, HIGH. For most conversational use cases, MEDIUM is fine. For reflective, thoughtful conversations, LOW gives the user space to think.

Connection limit is 8 minutes per WebSocket connection. For longer sessions, AWS provides a session continuation pattern in their samples.

System prompt gotchas worth knowing:

Without explicit instructions, the model repeats its welcome message on every turn. You need: "FIRST RESPONSE ONLY: Welcome with 'In the name of the Father...' AFTER THE FIRST RESPONSE: Do NOT repeat the welcome ever again."
Hindi requires the speech instruction appended inline to the system prompt text, not as a separate content block. A separate block caused Hindi to be completely silent. No error. Just silence.
Nova 2 Sonic sends { "interrupted": true } as JSON text during barge-in. Filter it server-side or your transcript gets polluted with raw JSON.

Echo Cancellation: Still an Active Problem

The iPhone speaker plays the AI's voice. The mic picks it up. The model hears itself and responds to its own echo. This is the hardest problem in the entire project.

Four approaches, in order of desperation:

Flag-based mute when AI speaks. Problem: contentEnd means the server finished sending, not the speaker finished playing.
Playback completion unmute using scheduleBuffer completion handlers. Better timing, still some leakage.
Send silence while muted. Zero-filled PCM frames keep the bidirectional stream alive while preventing echo from reaching the model.
Prompt-level echo awareness: "You will hear your own voice echoed back. If you hear words identical to what you just said, that is YOUR OWN ECHO. IGNORE IT."

The official AWS Python sample never mutes the mic. Audio flows continuously in both directions. Nova 2 Sonic has built-in turn detection and handles echo internally. That works better on desktop where mic/speaker separation is cleaner. On a phone speaker, it's a different story.

Current approach: silence-based muting + playback completion unmute + prompt-level echo awareness. Works about 90% of the time. The other 10% is still an active problem.

The Economics

Nova 2 Sonic pricing (per 1K tokens): $0.0034 speech input, $0.0136 speech output. That works out to roughly $0.02/min combined.

A 5-minute session costs about $0.10. Not $1. Not $5. Ten cents.

At $1.99 per session, after Apple's 30% cut ($1.39 net), that's about $1.29 profit per paid session. The margins are real.

The freemium flow: 1 free minute (configurable via DynamoDB, no app update needed), then a paywall. Pay $1.99 to continue for up to 5 minutes. A 30-second grace period after the paywall gives users time to decide. The free minute costs about $0.02 to deliver. Even if nobody converts, the free tier costs almost nothing to run.

	Per Session
Nova 2 Sonic cost (5 min)	~$0.10
User pays	$1.99
Apple's 30% cut	-$0.60
Net to developer	$1.39
Profit	~$1.29

Session recordings are AI audio only, stored locally on the user's device. Never on servers. The user can save and share via the standard iOS share sheet.

Total AWS infrastructure cost for the backend? Lambda, DynamoDB, API Gateway, AgentCore Runtime. On a personal account with low traffic, we're talking single-digit dollars per month before any sessions even happen.

What's Next: AgentCore WebRTC

On March 20, AWS announced WebRTC support for AgentCore Runtime. This is a big deal for this project.

Right now, the audio path is: iPhone → WebSocket → AgentCore → Nova 2 Sonic. WebSocket works, but it's a text-based protocol carrying base64-encoded audio. Every audio frame gets encoded, wrapped in JSON, and decoded on the other end.

WebRTC is purpose-built for real-time media. Binary audio frames. Built-in echo cancellation at the protocol level. Adaptive bitrate. Jitter buffers. All the things I've been fighting to implement manually in Swift.

The migration path is straightforward since AgentCore Runtime supports both protocols on the same runtime. I can add a WebRTC endpoint alongside the existing WebSocket one and A/B test latency and echo cancellation quality on real devices.

If WebRTC's built-in echo cancellation works well on iPhone speakers, that solves the hardest remaining problem in the entire project. That 10% failure rate on echo could drop to near zero.

What I'd Do Differently

Verify audio end-to-end on day one. Inject a known sine wave, record what the model receives, compare. Would have saved hours of headache.

Start with the classic Cognito flow. Don't even try enhanced flow with Bedrock. It won't work and the error messages won't tell you why.

Build echo cancellation as a first-class feature, not an afterthought. On mobile, this is the hardest problem. Budget time for it.

Use Docker assets (ECR) for AgentCore from the start. Code assets (S3) seem simpler to package but the cold start timeout makes them impractical for anything with dependencies.

Design for multi-region from day one. The frugal routing across three regions was an afterthought that should have been the starting architecture. Default quotas are generous if you think horizontally.

The Point

1.39 billion baptized Catholics worldwide. A sacrament that many want to practice but feel unprepared for.

This isn't about replacing priests. It's about removing the anxiety barrier that keeps people from showing up in the first place. A 5-minute voice conversation that helps you organize your thoughts, in your own language, on your own time. For ten cents of compute.

Building this was fast. The debugging was not. Every major bug was the invisible kind: plausible RMS values hiding garbage audio, silent timers on runloop-less threads, session policies you can't see in any console. The documentation had gaps. Echo cancellation on a phone speaker remains partially unsolved.

Most of the code was written with Kiro CLI. What would have taken weeks of back-and-forth between Swift, Python, and CDK was done in hours.

But someone will open this app before their first confession in 20 years. And they'll feel a little less anxious walking in.

That's worth ten cents. Now go build.

DEV Community