🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - The Rise of AI-powered Voice Communications in Financial Services (AIM263)
In this video, Christophe Van de Weyer from Vonage and Chao Wang from Amazon discuss AI-powered voice communication in financial services using Amazon Nova Sonic, a speech-to-speech foundation model. They explain how this technology enhances customer experience through 24/7 availability, emotional intelligence, and personalization, while reducing costs and improving security via voice authentication. Use cases include balance inquiries, fraud detection, investment recommendations, and insurance claims processing. A live demo showcases voice authentication and account management. Chao announces Nova Sonic 2 features including Hindi and Portuguese support, polyglot voices, asynchronous tool calling, text input capabilities, and configurable turn-taking behavior.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
The Persistence of Voice and AI-Powered Communication Benefits in Financial Services
Alright. Why are we talking today about voice? Voice is actually not yet out of fashion. Voice still exists and is very persistent across different technologies from fixed to mobile to voice assistants and now in AI-powered voice communication. What we're going to do today is really talk about how AI-powered voice communication has benefits to improve customer experience, reduce costs, and increase security in financial services. I'm here today with Chao, and I'll let you present yourself, Chao.
Sure, so my name is Chao Wang. I'm a Director of Applied Science with Amazon. I've been with Amazon for about 10 years, initially working in Alexa on speech interactive experiences, and now I lead a team in AGI Foundations focusing on building the speech-to-speech foundation model. I heard, I hope you've all heard about the announcement of Amazon Nova Sonic launch this morning and I'm really excited to have this opportunity to talk to you about Sonic and also about the collaboration with Vonage to use speech foundation models to build AI agents that can interact with users as naturally and intelligently as human agents.
Thank you. My name is Christophe Van de Weyer. I'm President of the API Business Unit at Vonage, and we are a company responsible for cloud communication, providing a communication platform as a service and network APIs allowing applications to communicate with users. As I said, voice is very persistent and it has been so over time across multiple technologies. It started with fixed telephony where you could dial a human, connecting offices and homes. Then it moved to mobile where you could talk and communicate on the move. Then came voice assistants where you could use your voice to issue commands or ask questions and receive responses. And now we have AI-powered voice where you can have a real interaction with an agent which is multi-turn and which also remembers your previous interactions.
Why is voice persistent? It's because it's fast, faster than dialing. It's situational, available wherever you are, and it supports context, nuance, and personalization. What are the benefits that we are seeing of AI-powered voice when we look at the finance sector? On one hand, there is enhanced customer experience through personalization. The AI agent remembers your past interactions and is available 24/7. You can always call an AI-powered voice agent. It's emotionally intelligent, so it can grasp frustration and show more empathy or adapt the pace and reaction accordingly.
Secondly, it results in efficiency. What type of efficiencies? Well, you can automate repetitive tasks like balance inquiries. It's also scalable in the sense that when you have peaks, for example during tax season, it can absorb that load through your agents. And of course there is cost savings from having fewer agents in your call center. But security is also important. Why security? Well, your voice footprint provides an ability to be authenticated. Of course, when there is fraud, you can be called by a voice agent when there is fraudulent activity on your account, and this is available 24/7. So there are many security and compliance applications that are applicable for AI-powered voice and also towards the future. But Chao will get more into detail on that. We will see there is an opportunity for much more proactivity and also integration with other types of channels like text so that you have a unified customer experience, not only purely voice.
Now, what are actually some of the use cases that it enables when you look at financial services? Actually, there's a multitude of use cases that are possible when we look at banking and financial services.
Customer support is very basic and routine. Balance inquiries can be done by agents, and it can personalize and give personalized recommendations. An agent can give you a recommendation for investments, but also fraud protection using your voice footprint for authentication. It can call you when fraudulent activity on your account has been spotted, and it can also record certain transactions for compliance reasons and know when to do that.
For wealth management and investment, other use cases include proactive recommendations for fund investments and many other engagements. On the insurance side, we could imagine basic claims entering where basic information around the claim can be done through AI agents, but also lead qualification. By asking simple questions, it could do a prioritization of leads or a risk assessment of a potential lead in insurance. There are a lot of use cases that are possible, but then comes the question of how we do this technology-wise.
Amazon Nova Sonic: Speech-to-Speech Foundation Model Architecture and Vonage Integration
That's why we're going to now give both a presentation on what Amazon Nova Sonic is enabling from a speech-to-speech modeling perspective and how the combination with the voice infrastructure of Vonage can make it easy for developers and enterprises to build agents in an easy way. I want to tell you a little bit more about Nova Sonic. Nova Sonic is a speech-to-speech foundation model for real-time human-like conversational AI. Let me walk you through this diagram first because that will give you an idea of why a speech-to-speech foundation model is like the future and revolutionizes how we build these kinds of agents.
The first thing when you look at this picture is how simple it is. If you've worked with voice agents for a long time, you'll probably be very familiar with the traditional cascade system architecture where you start with an ASR system that converts speech into text. Then you might have an NLU system that derives meaning out of that text, and you have some kind of dialogue control that figures out how to respond to the customer. Then you have a TTS system that turns the response text into speech.
With a foundation model, you simply integrate with a bidirectional streaming interface, and this interface handles everything. You have speech coming in, and then inside this model, it's going to do streaming-based processing as soon as the speech is received by the model and the inference engine. It's going to generate a live transcript out of the speech. It can also generate API tool calling commands if the input of the user requires that kind of communication and handling. When the tool call information comes back, the model can generate the response incorporating the information coming back from the tool and from the developer's design of what this agent is supposed to do.
At that point in time, it's going to generate agent speech. The beauty of this whole pipeline is that we actually have the transcript, which is the text representation of what the user said as an intermediate output. This helps with grounding and shows what the agent thinks the user is saying, bringing transparency about the agent's behavior in terms of how to handle this. It's also able to generate the speech taking into account all the information that's available to this large model, which includes all the dialogue history as well as the input user speech, so that the agent's speech can be not only natural and expressive, but also it's paying attention to the context.
This means the agent pays attention to how the user is expressing themselves in addition to the content of what they say. That way, the agent's behavior will be much more empathetic and human-like. With this architecture, you can achieve very fluent, fluid dialogue handling and natural turn-taking. What I mean is that as soon as the user finishes speaking, the system makes a determination that the customer has finished speaking and starts responding. But if the customer interrupts the agent's responses, it stops almost immediately to listen to the customer and then respond to what they said next.
We are very proud of our speech recognition performance, and you probably saw in this morning's keynote presentation that we achieve state-of-the-art accuracy in speech recognition. I mentioned that we support adaptive speech responses, meaning it adapts to how the user is speaking, not only what they are speaking. We have very expressive voices, and I hope you'll see a demo that demonstrates that. The voices are available in multiple languages, and because of this very simple architecture of speech-to-speech, it naturally achieves very low latency in responding.
What we are doing is combining Amazon Nova Sonic technology with the Vonage voice API. We are combining the Amazon Nova Sonic AI capabilities and speech-to-speech models with the voice infrastructure of Vonage. This makes it very easy to do inbound calls and outbound calling. It can be done on multiple technologies, so it can be done on mobile apps, web apps, or traditional calls. The combination of the voice infrastructure together with Nova Sonic makes it easy for enterprises and developers to build agents, and that is what we really try to do through this partnership and combination: to make it easy for developers to build agents for the future.
Of course, we want to demo this because it is always better than explaining slides and how it could work. Hello, this is your financial services assistant. For your security, please say the phrase my voice confirms my identity. My voice confirms my identity. Authentication is complete and secure. Hello, how can I assist you today? What is my current savings account balance? Let me retrieve your current savings account balance for you. Your savings balance is $23,540.15. Let's switch to your investments. Your portfolio is doing quite well this year, with a total value of $512,040.75, which is up by 6.2% year to date. Let's schedule a call with my financial advisor. Can you help me with that? Certainly, I can help you schedule a call with your financial advisor. Let me check the availability for next week. It looks like your advisor is available on Tuesday at 10:00 a.m. Would that work for you? Do you have any openings for Wednesday? Let me check the availability for Wednesday. It appears that your advisor is available on Wednesday at 2 p.m. Would that time work for you? That works. Great. I've scheduled a meeting with your financial advisor for Wednesday at 2:00 p.m. You will receive a confirmation email shortly.
Nova Sonic 2 Enhancements: Multilingual Support, Asynchronous Tool Calling, and Configurable Turn-Taking
That is how it works. Now, of course, we have explained a bit about what we have now, but maybe you can tell us a bit about what is next without divulging any secrets. Well, actually it is easy for me because we just announced Nova Sonic 2, which is the next generation of Sonic. Let me give you a sneak peek of the new features that we launched. Sonic in version one supports English, French, Italian, German, and Spanish.
With Sonic 2, we added new languages: Hindi and Portuguese. We have an exciting language roadmap in the pipeline that we hope to share in coming quarters.
We also added what we call polyglot voices, which means the voice can speak multiple languages even though the original voice talent were only monolingual. This way we can handle sessions with language switching very seamlessly. A user can literally switch between, let's say, English and Spanish, depending on how they feel comfortable almost naturally expressing themselves, and the same agent voice can respond in English or Spanish given that context.
We also added a feature called asynchronous tool calling. These models are very fast if you're just doing casual conversation, but sometimes when you go off to do a tool call, that can take a long time. If you're just waiting for the tool call to come back to continue the conversation, you sometimes get stuck in an awkward silence with the user, and sometimes the user wonders what's going on because the system almost feels dead or broken. We enabled this asynchronous tool calling capability where agents can basically continue the conversation with the customer when the tool call is being executed in the background. When the tool call results come back, the agent incorporates that information into the dialogue to continue the conversation with the customer in a very natural way.
Another feature we added is the ability to add text into the speech input as well. So it's not just speech to speech; it's actually speech and text to speech and text. Even though we say these are voice-first experiences, having text capabilities is helpful for many use cases. For example, if you want the agent to say something before the customer says anything, you can use the text input from the developer side to prompt the model so the agent will start speaking without waiting for user speech input. You can also use that capability to handle touchpad input so that key presses can be converted into text as input into the model.
The benefit of this is that all of these are seamlessly incorporated into the model as if it's just one thing. There's never a loss of context or switch of context when you have different modality of input. It's all the same context information available to the model.
One last feature I want to talk about is configurable turn-taking behavior. We all like these agents to be snappy—you talk and then the agent comes back. But there are actually scenarios where users prefer the agents to be a little more relaxed so they don't feel like they have to talk in case they stop and the agent cuts them off. There is an ability to configure these thresholds to be sensitivity low, medium, or high so you can basically trade off some of these latencies for a more relaxed kind of interaction vibe.
I think that's what I can share now. Hopefully, if you watch our Nova pages, you'll see we always launch updates and new features over the course of the year. If you want to learn how to build with Sonic, there are quite a few sessions available. There's a workshop which is amazing—it's two hours long and tells you everything about how to build these kinds of agents. Unfortunately, it's happening right now, so you won't be able to catch it, but these sessions are recorded, so you should be able to look it up by searching for keywords or the session to watch it at a later time.
There are chalk talks happening in the next couple of days, and there's also a breakout session which is a one-hour long presentation about Sonic. I hope you'll be able to catch that one tomorrow. You're always welcome at the Sonic meeting room, and we also have drinks and a presentation later tonight in Paris. I think some leaflets will be handed out right over there if you're interested to come and learn more. Thank you all for your attention. Thank you so much for attending.
; This article is entirely auto-generated using Amazon Bedrock.

















Top comments (0)