Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Nova 2: Enterprise intelligence optimized for the real world (AIM3342)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Nova 2: Enterprise intelligence optimized for the real world (AIM3342)

In this video, Firat Elbey, Ryan Hoium, and Abhinay Kathuria from Zendesk introduce Amazon Nova 2, the next generation foundation models delivering industry-leading price performance. The session covers four new models: Nova 2 Lite (fast, cost-effective reasoning), Nova 2 Pro (most intelligent reasoning model), Nova 2 Omni (unified multimodal reasoning), and Nova Sonic (speech-to-speech for conversational AI). Key capabilities include developer controls with adjustable reasoning levels, multi-step reasoning, native tool use, and extended context supporting up to 200 languages. Nova 2 Lite outperforms the previous flagship Nova Premier while being seven times lower in cost and up to five times faster. Ryan discusses real-world evaluation methodology focusing on classification, document understanding, and agentic workflows. Abhinay shares Zendesk's success using Nova for translations (300 million words annually across 30+ languages) and automated resolutions (5 billion resolutions this year), with Nova 2 showing 5% less hallucination, 2% less language mixing, and 3% less mistranslation. The session also introduces NovaForge for custom model building and demonstrates practical applications through live demos.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to Amazon Nova 2 and AWS's AI Innovation Journey

Hey everyone. My name is Firat Elbey. I'm a Principal Product Manager for Amazon Nova, and I'm really excited to talk to you today about Amazon Nova 2. I'm joined here with my colleague Ryan Hoium, who heads the Applied AI Solutions Architecture team at Amazon. And we've got Abhinay Kathuria, who also joins us from Zendesk, and he heads the Machine Learning Platform teams. So we're going to quickly give you an introduction to Amazon Nova 2, and then we're going to dive a bit deeper into some of the capabilities and some of the results that we've seen. Then Ryan is going to go deeper into real-world evaluation and some of the key use cases that we've seen customers have success with. And then Abhinay is going to dive deeper into his success that he's had with Amazon Nova.

Amazon has been shipping AI innovations for decades. And at each critical inflection point, we've been there, building industry-leading technologies for our business and our customers. For example, starting with rule-based and decision systems, we've built early recommendation systems. And then when you move to classical machine learning, we do things like inventory forecasting. And then if you move to early language models, we've built many models for products like Alexa, including speech recognition, natural language processing, and understanding. And then moving on to foundation models with Amazon Nova, and then last but not least, agents with Quiro, Nova Act, and so on. We've really been innovating in this space for the last few decades.

And to bring these innovations to you, we ship at each layer of the stack, from custom compute to turnkey solutions. Starting from the bottom, we've got our compute with AWS Trainium and AWS Inferentia, and then we've got the tools that allow you to train these models and process the data. And then we have the applications, agents, and models like Amazon Bedrock and Amazon Nova, so you can build on top of that. And then we have our kind of vertical solutions, our turnkey solutions like Amazon Connect for customer experience, software development with Kino and AWS DevOps Agent, as well as AWS Security Agent, AWS Transform for migration and modernization, and Amazon Quick Suite for business productivity, as well as the AWS Marketplace.

And then on top of that we've got our specialized expertise. The Gen AI Innovation Center works with you on proof of concepts and really helps you get the most out of your Gen AI applications and the outcomes that you're looking for. And then we have our broader partner network with over 140 partners who also work with customers to enable them to get the best out of AWS. Let's zoom in a bit. We have Amazon Bedrock, a comprehensive service for generative AI application and agent development. Bedrock offers access to leading foundation models and tools to enable you to build your AI applications. You can customize models and applications with your data, apply safety guardrails, optimize cost and latency, and rapidly iterate.

Unveiling the Amazon Nova 2 Model Family: Four New Models for Enhanced Intelligence

Let's jump into Amazon Nova. We've got thousands of customers across many industries using Amazon Nova to build AI solutions today. Since we launched this time last year, we've seen lots of great successful stories, and here's a selection of customers that use Amazon Nova today. As we spoke to those customers, they told us several things. First, they wanted more intelligence. They wanted smarter models for more complex tasks. Then they wanted more modalities. There are a lot of different models out there that offer different capabilities, and you have to chain these things together. Customers really didn't like that.

Gen AI is great and solves a lot of problems, but how do you scale it? So they wanted lower costs and faster, scalable performance in the production environment. And for our more advanced customers, they wanted more options for customization. You need to go beyond context engineering and do fine-tuning or even go deeper into model training and the pre-training side of things to get the most out of models and deliver the experience that you're looking for. So I'm really excited to introduce, as Matt mentioned in the keynote, Amazon Nova 2, our next generation models that deliver industry-leading price performance across reasoning, multimodal use cases, as well as conversational AI. Today we announced four additional models to make up the Nova 2 model family.

Starting with Amazon Nova 2 Lite, this is our fast, cost-effective reasoning model for your everyday workloads, your general NLP tasks, your routing needs, and your classification use cases. Then you have Nova 2 Pro, which is our most intelligent reasoning model for those more highly complex tasks. Think agentic coding and multi-step agentic workflows. And then we have Nova 2 Omni, which is the industry's first unified multimodal reasoning model. This model can handle all modalities on the input while also generating text and images in an industry first. And then we have Amazon Nova Sonic, our next generation speech-to-speech model for your real-time conversational AI needs.

Let's dive deeper into Nova 2 Lite and Pro. Here are a few key attributes I want to call out. Nova 2 Lite is available generally today. Nova 2 Pro is in preview, early access for our NovaForge customers, which Ryan will cover a little bit later. Both models support long context. We see a lot of applications, especially agentic workflows where agents use context compression techniques that don't quite keep all the necessary context to make the experience delightful. So we're ensuring that our models support longer context.

These models support up to 200 languages. As I mentioned earlier, Nova 2 Lite and Pro support multiple input modalities, as you can see on the screen. Nova 2 Pro actually supports, in addition to all the other modalities that Lite supports, speech as input. On Nova customization, Nova 2 Lite is available, and for Nova 2 Pro it will be coming soon. These models represent a significant leap in intelligence from the previous existing Nova family. For example, Nova 2 Lite outperforms Nova Premier, our previous flagship, on multi-step reasoning and agentic workflows. Nova Premier was our flagship model earlier this year, and Nova 2 Lite, our cost-effective reasoning model, is winning in a number of areas. And it does this while being seven times lower in cost and up to five times faster. So it's quite significant in terms of where we've come from the original Nova family.

Nova 2 Performance Benchmarks and Developer Controls for Reasoning

The key question is how do these models compare to other models in their similar intelligence tiers? Here you can see Nova 2 Lite delivers incredible price performance for many workloads we see customers looking to deploy in production today. Nova 2 Lite excels in areas like instruction following, tool calling, generating code, and extracting information from documents, often matching or exceeding performance of comparable models. It compares favorably in industry benchmarks to Claude Haiku 4.5, GPT-5 Mini, as well as Gemini 2.5 Flash. And Nova 2 Pro, our most intelligent reasoning model, is great for your highly complex needs, particularly when you look at some of the more important areas like instruction following, extracting documents, and tool use. You can see it's one of the leading models when compared to models in a similar intelligence tier. Not only that, these models are incredibly fast and cost effective. Nova 2 Lite is in what artificial analysts like to call the magic quadrant, achieving over 220 output tokens per second. In terms of input-output price, you can see it's far on the left side. So we're not only trying to give you intelligent models, we're trying to make them scalable for you in production.

Next I want to cover some of the key capabilities of these Nova 2 models. These include developer controls, multi-step reasoning, native tool use, built-in tools, and extended context. Let's jump into each one by one. Starting with developer controls, with Nova 2 we wanted to give developers control and the ability to toggle reasoning on or off. You can adjust how much the model thinks based on the task. There's a dial that you can control from low, medium, or high, and you can tune this to a specific use case. This enables you to manage latency and control your costs. You have multiple options: you can turn reasoning off or you can use reasoning at different levels to really tune in what you need to ensure that you get the right price performance for your use case.

In this demo I'm going to show you an example. First I'll show you with reasoning off, and then I'll show you with low reasoning. Here, I've actually attached two documents around 70 pages in length in this user interface.

I want to see if the model is able to determine whether these documents support the hypothesis that I'll have on the screen. You can see it's processing the document, and this is the hypothesis I put. I'm asking whether these documents validate or invalidate it. The model says supported, but that's actually the wrong answer. It's actually meant to be not supported because one of the papers actually disagrees with the hypothesis.

Let's take a look at this with a different approach. This time you'll see a toggle on the bottom, and I'll speak about this chat interface later in the session. You can see we've set it to low, and now we're going through the same exact prompt with the same exact documents. This time, the model was able to reason through and actually picked out the missing mechanism at the bottom. It's able to reason through those 70 pages and effectively use this new reasoning extended thinking capability to get the answer right this time.

Key Capabilities: Multi-Step Reasoning, Native Tool Use, and Long Context Processing

Let's move on to multi-step reasoning. We hear this a lot, but it's really critical that we enable our models to take these kinds of complex tasks and break them down. We've really optimized Nova for this kind of capability. Nova is able to look at a problem, reason through it, break it down, and then call the right tools in the right order. We've seen customer support use cases and assistant interactions where you have multiple tasks that the agent needs to do, and it's able to reason through all of that. It's able to detect failures and course correct along the way as well.

In this example, a Nova 2 powered agent updates a repository to enable support for Nova's new extended reasoning capability. This means we're reasoning through planning and executing a full workflow, including executing multiple tools. In this situation, we're running an agent using Nova and Agent Core with Strands. Strands is the Gentic framework for building multi-step workflows, and Agent Core is the fully managed runtime that hosts and orchestrates these agents at scale.

Our agent is able to navigate through and analyze the issue, generating the right code and files. You can see it's making multiple tool calls. Then it's able to put a comment to say that the actual merger was complete. It's able to be plugged into an agentic framework and then effectively reason, plan, and deliver that specific use case. It's really critical for models to handle tools effectively and enable models to access external knowledge or take action. Nova has been optimized for tool use, offering higher reliability and accuracy.

It can take calls in a sequence or do parallel tool calls. Nova offers built-in tools such as a code interpreter, where you can use Python to do complex math calculations, as well as web grounding to access external information. We're orchestrating all of this on your behalf, so you just simply turn it on in the API and you're able to get these two additional tools available to the model and use it for your use case.

Here's an example of a query that involves math. We also enable code interpreter in this query. You can see the user prompt and then the tool config, which is really simple to enable. The model chooses to use the tool and generates the Python code to perform the calculation and gets the correct result. Tools like this, especially in this context, are great to improve accuracy and really lift the overall performance. We know that LLMs are not great at math, so these kinds of additional tools really help improve performance.

Lastly, we have long context. This is critical, as I mentioned earlier, for things like document processing. If you have tens, twenties, or hundreds of documents with 100-page documents, they could really consume a large amount of context. The same goes for things like video as well. This can take up to 90 minutes of video, which is pretty long for a larger video file. The model is able to reason over it and understand it. We'll continue to work on innovating in this space and try to extend the context so that you can build these complex agentic workflows and leverage these capabilities.

Here's a great use case where long context becomes useful. Nova 2 is building a web app for a real estate agent using Claude. We give it a prompt to build the app, and you can see it's generating code and the multiple files needed to make the app functional. Once complete, we can then launch the app.

So you can see the created plan go through and generate code. Once complete, it will launch the app, and this is now completed, so it will run the command to actually launch the app. Then we should see the resulting webpage. This is the first iteration, and it has created a basic CRM system with property listings. Pretty good job on a first pass, but what is powerful is that you can keep iterating in multiple turns and use that long context to really get the outcome you are looking for.

Here are a couple of customers who have tried Nova 2 that I want to walk through. Siemens has been trying to use it to improve search, and they have said that Nova 2 Lite is up to three times faster than other models while still offering the high quality responses that their specific application needs. Again, the speed and cost efficiency really come into play here. Next we have Trellix, who has been using our Nova 1 family. They specifically were using Nova 1 Lite, and for them, Nova 2 no longer has any failures in tool calling. They have achieved a 39 percent improvement in threat classification and over three times more detailed responses with better technical analysis. It is really great to see how these models are moving from the first generation and how customers are continuing to get value and scale these things into production.

Real-World Evaluation Methodology: Moving Beyond Academic Benchmarks

I am going to hand it over to Ryan now, who will talk a bit more about real world use cases. Thanks for that. Can you guys hear me okay? Yeah, so I am Ryan. I lead a solutions architecture team within our AGI organization. I have been working on the Nova model family for the last two years. One of the things we learned, and I hear a lot from customers, is that many of these model providers, us included, last year worked pretty well at the benchmarks, but when people try to use these things on some of the problems that mattered the most to them, the model did not always work or it took a lot of extra work to make it work for those use cases.

We started this effort to really get a set of evals that was outside of the academic benchmarks that you see on the marketing slides and really modeled after the kinds of use cases that you are all building for your enterprises. I am going to talk a little bit about our mental model for that and we will talk through an example here. I have talked to most of this, but we start by really working with our customers. A lot of these are internal to Amazon, but also folks, maybe some here in the room. We work to understand that use case and what that means. What are you trying to accomplish with the model?

Then we build eval sets. We either get data from our customer or partner, we procure it, and then we run these things continuously every day, sometimes every night, once a week, or once a month, depending on the eval set. We run these to see how the models that we are building are performing, and then we use that to improve the data that we train with as well as the prompts that we use for the applications that we are building ourselves or helping our customers build.

We will dig in here and talk a little bit about the process. I have had the opportunity to talk to dozens or hundreds of customers over the last couple of years, and a lot of people do not necessarily know how to do this. We start by understanding what that business problem is. What are the dimensions that matter? What is the rubric that the subject matter expert that you are building that application with cares about? What are the dimensions over which the data will vary from the user base? Maybe there are four or five different categories within the prompts that are going into the system. Maybe there are ten or fifteen different tools that need to be used. All those things are really critical that you understand at the outset here.

Then we either collect prompt and response data, which we call traces. We collect that data. In some cases, we use the understanding of the business problem in order to synthesize that data or procure it from a vendor. Then we run those evals, and we analyze the output of the model and code the failure modes. We will talk about this in a little bit more detail here on the next slide. Then we measure that by building metrics. As often as we can, we use programmatic metrics to do that. Other techniques, like LLM as judge, everybody has heard about. Then as we get those metrics instrumented properly, we continue to refine the data, the metrics, and the prompts that we are using in order to run these evals.

We've applied this approach to many different use cases over the course of the last couple of years, but I'll focus on three here. One that we see every day is classification. Can I classify an email? Can I classify a customer service case that came in? Abby's going to talk about one in a few minutes that's important to their business. It's a super common use case. Every business that I've talked to has that.

The second one that we really focused on this past year is document understanding. Can we really understand what's going on inside of documents and specifically can we extract the content from that document in a structured way that is useful to your business processes? We'll talk about that. Then the last one is agentic workflows. Can we actually make agents that work for business processes that matter to you so that you can use humans for high judgment tasks instead of the tasks that are easy to automate?

In each one of those cases, we have anywhere from two to ten different customer evaluation sets that we are running on a continuous basis to make sure these models are starting to generalize and work outside of what you see on the benchmark slides. Let me talk a little bit about how we do the error analysis part. This is the part that I get the most questions about. We take a very simple methodology. Hopefully many of you are already doing this, but we try to apply it with very high standards.

Simply go through and read these prompts and these responses. Study them in detail. A lot of people just look at the end result and check whether they got the answer right or wrong or whether they agree with the thumbs up or thumbs down. You have to look deeper than that. Look at what happened. Compare the model output with what a subject matter expert would say is the right answer, and then notate what are the differences and why those differences are there.

Highlight what those errors are in detail. Take notes, long form notes, little notes in the margins. What are those notes? What are those common commentary on what the model got wrong and why it was wrong? Those will help you ultimately code these failure modes into buckets that will help you improve your prompts, improve data if you're fine tuning your models, and ultimately make your system better, optimize your prompts better, and continuously improve your systems.

We do this for at least one hundred traces to make sure we have statistical significance. For some systems, it has to be more than that. If you have a really broad distribution of your underlying data sets, at least one hundred traces is our target. We try to get through that, bucket those failures, and you learn a tremendous amount. Your prompts will get better really quickly. If you're building custom models like we are, obviously we're building foundation models, it helps you really make sure you're generating data and choosing data mixes that will help you make the models better.

Classification Use Case Analysis and Nova Forge Custom Model Building

Let me talk through an example. What you see here on the screen is a pretty simple prompt. It's a little bit contrived. I simplified it so it fit on the screen, but probably everybody in the room has done something like this. Can I take a product? Can I take a reference product? And can I determine if they're the same? There are five categories there. These business rules are described semantically in a way that a human could make this decision reasonably, and then I give a very specific output format.

I want a simple explanation for why the model is making the decision it's making, and then give an answer in JSON format. Everybody's done this. Every customer I've talked to has done this. Let's see how the model did. You can see right off the bat you get a response that sort of looks right, but it's not. Instead of giving the rationale first, it gave the answer. Sometimes these models are jumpy and they like to answer first.

You can see it got the wrong answer. It gave the classification first and it's the wrong answer. It followed the format that I used in the prompt but not in the output format. That's a bad prompt on my part. I'll fix that. But then you can see they repeated that wrong answer within the JSON. Interestingly, the rationale is there. The rationale came after the answer. That's interesting. You can see the rationale actually makes it sound like the answer is correct. It says Brand Y has a different scent profile than Brand X, and so these products are incompatible.

A reasonable human wouldn't make that judgment based on the description on the screen, but the model backed into the answer in this case. It's surprising, but you do see this behavior from most models in certain situations. I learned a lot about how to improve the prompt in this particular case.

Taking that methodology a step forward, we coded these out and bucketed them into categories. We started to figure out how to measure these things. On that classification example I shared, we have about 8 or 10 test sets that are out of sample of what the science team uses. We built simple metrics, which are overly simplified here—the real ones we use are a little more complex—but we built very simple programmatic metrics. We had ground truth annotated by humans, but we built a bunch of additional simple metrics in order to measure the failure modes the model was exhibiting.

The result is really awesome. What you see here on the screen is those three use cases I described. The blue bar is Nova 1 Pro, our best model at the time. The middle bar is Nova 2 Lite, our mid-tier model now. We want our mid-tier models to beat our best model quite a bit in these three cases. The pink bar on the right is Nova 2 Pro, which is in a gated preview mode right now.

We're really excited that this methodology works. These are all out of sample and none of these are in any of the public benchmarks that you read about or hear about. We're excited about the fact that we're starting to get hopefully more generalized results for the types of use cases that you need. We recognize that a lot of times you need to make the models really good at tasks that are important to your business.

Earlier today we announced Nova Forge, which is really the ability to build custom models in a very tailored way for your organization. There's a talk tomorrow where my colleague Karen is speaking with a customer on Nova Forge. You should go check out that talk—it's around noon tomorrow. It's a great set of tools that will let you build very specific models, either foundation models for your organization or very task-specific models that are really tuned for the types of tasks you're doing.

It has all the techniques: data mixing, reinforcement learning, fine tuning, SFT, and continued pre-training—everything you need in order to get started building custom models, hopefully without all of the undifferentiated heavy lifting you have to do if you do it yourself. The results are also good. We took that classification use case that I shared in the previous slide. Nova 2 Lite was 7 percent absolute improvement over baseline. When we actually went and fine-tuned that using Nova Forge with annotated reasoning traces, we were able to get a 21 percent lift over baseline.

It's really compelling results, and particularly in these classification use cases, as you'll hear from Abby in a second, can have a huge business impact if you have higher precision in some of these classification use cases. That's all I had. I'm going to pass it off to Abby, and I'd love to take questions after the talk from any of you that may have them.

Zendesk's AI Transformation: Translation Services Powered by Amazon Nova

Thanks, Ryan. My name is Abby. I head up the AI platform group over at Zendesk. I run engineering teams which build out the core AI platform, run evals, and build customer-centric services. I hope all of you have been enjoying re:Invent so far and have really enjoyed the Nova launches. I definitely have.

For people who don't know Zendesk, our mission is core and simple: to make customer service radically better by building intuitive products powered by AI. Today, I'll walk you through what we're doing at Zendesk, how we're partnering with AWS, how we're partnering with Nova models, and how we're making it work for our use cases. I'll also give you a sneak peek about the Nova 2 results. Spoiler alert: they look great, and I'll deep dive into them in a bit.

Before we start diving directly into the use cases, I want to talk about what we're doing at Zendesk around the AI story. We're really reimagining customer service at its core—not at tickets anymore, not as users, but really as a journey towards resolution, and that's what you see here, which is the Zendesk resolution platform.

If you look at the circle on the right at the centerpiece, you see agents. These are a network of AI-powered agents that can adapt, reason, and actually solve real customer interactions. They are not chatbots anymore. They can go end-to-end and solve complex customer queries. They can also connect and escalate to humans as needed, so you're efficiently solving tickets and queries without compromising on quality.

If you look at the next part, which is the Knowledge Graph, a knowledge graph is basically a unified self-service knowledge base that connects all knowledge bases, past tickets, and external systems. These agents have context that they can refer to, and that basically forms the foundation of how agents can solve problems. On the next piece, you have actions and integrations, which are a core part of why agents are able to solve user queries. Actions and integrations basically allow you to call tickets, call external systems, and call APIs, which gives that agentic capability to your AI.

Last but not least is measurement, insights, governance, and control, which gives Zendesk admins insights into how AI is solving queries, how it's reasoning, how it solved a query, why it didn't, and why it escalated to humans. This really gives you the metrics of what's happening behind the scenes. Zendesk had a really great year of growth, and it's no surprise to anyone at Zendesk that AI is our fastest growing product in the history of Zendesk. By the end of this month, we'll have 20,000 plus B2B customers who are going to be using our Zendesk AI platform.

That's almost more than 200 billion ARR, just coming from starting at zero revenue in AI two years ago to 200 million plus. That's where AI is transforming the core of our resolution journey today. If you see 60 to 80 percent of our end-user interactions are now being solved through AI, not just touched or interacted with, but actually being solved through AI. Human agents can now focus on the most complex queries and can give that human touch to customers who actually need it.

What you see on the screens right now is what forms the foundation of our resolution platform. We have the AI agents that I just talked about, the co-pilot which allows for human assistance, auto QA so that we are checking for quality both on human interactions and bot interactions, the Action Builder which allows you to build apps on top of Zendesk, knowledge bases, and the AI Insight Hub. Overall, that basically forms the foundation of our resolution platform. So enough of what Zendesk is and diving into what everyone is here for, which is the use cases and how Nova is solving that.

The first one that I'm going to talk about is translations. If you can imagine, global service is global, and language shouldn't really be a barrier to solve these queries with Zendesk AI. Every human agent becomes a multilingual agent. How we do that is basically by supporting bidirectional translation for every ticket and every conversation that is coming in. What that actually means for our customers is they can have naturally supported translation within the product without having to go outside, which helps them improve CSAT and solve queries faster.

It also means that they can have expanded coverage of support over multiple languages without having to go through hiring language specialists. The next one that I'm going to talk about is automated resolutions. If you think about how many times I've talked about resolutions today, automated resolutions is super critical for us. The reason being is this is how we think about how many resolutions our customers are making. A successful resolution means an issue is solved without a human in the loop at all, so an end-to-end interaction that gets solved through AI. Amazon Nova 1 is already having a massive impact for us in translation.

Today, we are translating over 300 million words per year across our customer base. That's 300 million more moments where language is not a barrier to customer support. We support more than 30 languages across 3 channels: email, web form, and API. This is how Nova 1 is already having a massive impact.

Before we go into Nova 2, let me show you what this looks like. If you look at the screen, the user is messaging in Chinese, but that automatically gets converted into English for the human agent. This allows the human agent to reply in English and enables bidirectional translation.

Now, talking about what everyone is more excited about, which is Nova 2, we are seeing three core parts that Nova 2 is already solving for. Three issues that we had with Nova 1 were hallucination, where the model was generating pieces of information while translating. Second was language mixing, where the output text still contained some part of the original language. Third was mistranslation, where the model did not know how to translate, so the output was exactly the same as the input. On all of those, we are seeing a massive impact. We have seen 5% less hallucination rates, 2% less on language mixing, and 3% less on mistranslation. Even though these numbers seem small, on a volume of 300 million words, that is a massive impact we are already going to have.

Not just that, it also helps us scale from 30 to 60 plus languages. We are already seeing Nova 2 improving the coverage of languages and how well it performs. That allows us to not only have multiple languages but also solve a massive customer problem around 5,500 million words across 5 channels. This leads to improved customer support and customer experience for our end users and their end users.

Automated Resolution at Scale and Closing Remarks on Nova 2's Impact

Talking about the next use case, which is automated resolution, this is very key to us at Zendesk. Automated resolution is where an AI agent can solve an end-to-end user query without human intervention.

How do we think about automated resolution and how does AI come into the picture? When a customer comes in and has a conversation with a chatbot, it can happen within a matter of seconds or it can happen in a matter of days. Once that is finished, we wait for 48 to 72 hours to make sure the conversation has ended. After that, we have a bunch of product metrics that we can see to determine what happened. Based on that, we know whether it is an automated resolution or not. Because automated resolution is so core to our resolutions and so core to our trust with our customers, we send that conversation back to an AI model to make sure it can capture all the nuances of language. That is where the LLMs and AI come in.

Today at Zendesk, we have already solved more than 5 billion resolutions this year. That is a massive number that is already going to keep increasing. At Zendesk, we believe that within the next couple of years, 80 plus percent of interactions are going to be solved through AI end to end without any human intervention. That is what we are aiming to achieve.

Talking about Nova 2 key results, there are three core parts to this as well. There is the false positive, there is the false negative. On the false positive, that means the AI thought it was a resolution, but it was not. That impacts customer trust massively. False negative is where the AI thought it was not a resolution, but it was. That means we are missing on resolution potential. The third problem that we saw with Nova 1 was the JSON format. A lot of customers have provided that feedback. Going back to looking at it, it is looking pretty good across the board right now.

We use Nova for this use case, but looking at Nova 2, it's looking super promising. These are early results, and we're going to do a deep eval on this, but so far it's looking really good. What that allows us to do is build customer trust, expand our potential, and also reduce our cost to serve.

Thanks, Abbi. And thanks, Ryan. It's really refreshing to hear from Abbi how Nova generally has improved some of the customers' experiences that they're having. So just to recap, today we announced our Nova 2 Foundation models across these different modalities and capabilities across reasoning and conversational AI. Ryan mentioned NovaForge, our customization offering, allowing you to go deeper into model training to really get the most out of foundation models.

We also announced Nova Act, which is a kind of browser use agent, and there are a number of agent capabilities that we're delivering. A lot of these talks are happening tomorrow, so please watch out for any Nova talks. We've got a lot of talks around how to build effective agents and NovaForge customization, so please keep an eye out. A lot of 34 of these sessions are tomorrow, so please take a look if you're interested.

You can navigate to Nova.Amazon.com. Earlier I showed you a demo of this site, and you can actually go there now. You can play around with some of these models. You can also go to Nova.Amazon.com/Dev, where you can have interoperable APIs where we can experiment and play with some of these models today. Chat with Nova, build with Nova—it's all in your hands. Give us your feedback. We're really excited for you to try them out.

Thank you so much for your time. We really appreciate it. If you have any questions for any one of us, we'll be around for a few minutes, so please feel free to come to us and ask us a few questions. Thank you for your time.

; This article is entirely auto-generated using Amazon Bedrock.