Imagine asking your app directly, "What was my biggest expense this month?" and having it instantly reply aloud, scolding you like a ruthless personal financial coach.
Today, I built a Voice Interaction loop into my Serverless Financial Agent. The architecture mixes native web APIs with AWS machine learning services, bound together by strict cost controls.
The Audio Pipeline
For input, I used the native browser SpeechRecognition API. It's completely free and captures the user's voice, submitting the form automatically upon silence.
For output, the Python Lambda calls Amazon Polly to synthesize the AI's response using high-fidelity Neural voices.
FinOps: The Caching Layer
Amazon Polly can get expensive if abused. I built a caching mechanism in DynamoDB (FinanceAgent-Cache). Every time text is synthesized, the Base64 MP3 is saved with a SHA-256 hash of the text and language. If the same response needs to be read again, the backend serves the cached audio instead of calling Polly.
Backend Gating & IAM
I didn't just hide the microphone button in the UI. I secured the backend route (?action=synthesize_voice) to strictly reject any requests that don't match specific premium user emails verified via Cognito JWTs. Finally, I locked down the Lambda execution role with a single inline policy: polly:SynthesizeSpeech.
Voice AI is a fantastic feature, but caching and access control are what make it production-ready.

Top comments (0)