DEV Community

Cover image for AWS re:Invent 2025 - Own Your AI – Blazing Fast OSS AI on AWS (STP104)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Own Your AI – Blazing Fast OSS AI on AWS (STP104)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Own Your AI – Blazing Fast OSS AI on AWS (STP104)

In this video, Fireworks AI presents their open-source inference and customization platform for building production agents across industries. The speaker explains challenges in agent development including model selection, latency, quality, and cost, then demonstrates how Fireworks addresses these through their FireOptimizer technology with 84,000 deployment parameters, custom CUDA kernels, and fine-tuning capabilities. Key features include day-one access to models like Deepseek and Llama, speculative decoding achieving 70%+ acceptance rates for sub-100ms latency, and reinforcement fine-tuning showing 20% quality improvements. Built on AWS infrastructure with deployment options from SaaS to air-gapped environments, the platform supports clients like Notion (100M+ users, 4x lower latency) and DoorDash (3x faster VLM processing, 10% cost reduction).


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Thumbnail 30

The Challenge of Building Production-Ready AI Agents

All right, looks like a crowd. Wow, nice, I love the energy from the three people that are probably done with their day and have to deal with one more person talking about AI. All right, so I'll try to make it fun for the three people that are looking at me. I will not be the first person you've heard this from, and I will most definitely not be the last person you've heard this from, but 2025 was the year of agents, and I would say probably the next decade will be the year of agents.

At Fireworks AI, we've helped a ton of companies across all different use cases to build agents in production. This includes coding agents with a lot of the companies when you walk around here using us, document agents that process banking information and legal information, sales and marketing agents that are doing outbound and inbound, hiring agents that are doing recruiting automatically, and customer service agents that are increasing CSAT and reducing average handling time. You name it, across all of the industries that you can imagine: retail, insurance, finance, education, life science, security, and more.

Thumbnail 90

You're probably saying, okay, well everybody's saying they build agents, but the reality is a lot of people don't talk about how hard it is to build agents. It sounds really sexy, agents are really cool, and everybody loves talking about agents. What people don't really talk about is how difficult it can be to build agents for specific business use cases. You have all of these possible errors that can happen and possible failure modes that can happen across multiple layers of the stack.

Just starting here, one is: are you going to use a closed-source model or an open-source model? If you're using an open-source model, are you using a small model, maybe an 8B or 5B model, or are you using a large model like Llama or Deepseek that are in the trillions of parameters? Once you pick your selection and you get your model, let's say it's a search use case, what about the latency? You try to deploy it and you want your search to be below one second. You need your LLM to be below 300 milliseconds, and putting that at scale when you have millions of users is really hard.

Then of course, quality is king. You need to ensure that the accuracy and the quality is the highest it can be. Once you have all of that set out, let's say you use the closed-source provider, you start to launch and then you see your costs are absolutely ballooning. So you kind of got across the first three, but now you can't really support it because it's too expensive. And then of course, if you want to do open-source models, there's a lot of infrastructure complexity involved. Are you running an EKS cluster, an ECS cluster? How are you deploying across multiple nodes with tens of GPUs or dozens of GPUs?

Of course, there's data privacy, compliance, security, and availability. Do you have the ML expertise? I could go on and on. You're probably getting bored because there are just so many of these errors. Now, what I want to convince you about today is that at Fireworks, we are basically the one-stop shop, the AI platform inference and customization engine that lets you forget about all of these and really build what we call magical AI experiences.

Thumbnail 230

Fireworks Platform: Open-Source Inference, Workload Optimization, and Fine-Tuning

What is the Fireworks platform? As I said, we are an open-source inference and customization engine for open-source models. At the very top layer, we make it really easy for developers to use any of the top open-source models with day one access. This includes Deepseek models—actually Deepseek released the model today and it's going to be available on Fireworks today—Llama models, Mistral, Qwen, you name it. Not only LLMs, but also vision language models and voice and ASR models. It's incredibly easy to get started, so we are compatible with the OpenAI SDK Python library. If you're using OpenAI, it's literally two lines of code that you have to switch: one is the model selection, and two, you just need an API key for Fireworks.

Once developers basically find the proper model that they want to run, let's say they're doing a search use case and they're running a really fast 8B or smaller model, then what happens very likely is that once they want to deploy that, they want the quality to not only match closed-source providers but beat closed-source providers. They want to have that latency of let's say 300 milliseconds to be stable as they scale to millions and millions of users. That's where our customization engine, Fireworks Optimizer, comes in.

We'll cover this in detail in the next slide, but we do two types of customization. One is workload optimization, which means if it's a search use case, there are very different deployment options than if you were to do an agentic AI use case, like some of our partners in the area. The second one is fine-tuning models. I have a saying that generic closed-source models are built for everyone but optimized for no one. They're great—they're one inch deep and miles wide, and they can do a ton of different things. But if you really push them on a very specific use case, they might have a lot of failure modes.

At Fireworks, we basically have our clients build models with their data and then we optimize for them. One of our co-founders used a phrase that I really like: you don't need a bazooka to kill a mosquito. That's exactly what Fireworks provides. Instead of a bazooka, you need something very specific to kill that mosquito and actually hit it. That's what our fine-tuning and reinforcement learning fine-tuning are all about—to really push that quality above what closed-source providers offer. Finally, of course, there's the question of how you scale. When you're scaling to millions of users or tens of millions of users, you want the platform to be super reliable. You want the SLAs to be top-notch and to have the highest reliability possible.

Thumbnail 410

We're able to do this because of our phenomenal partners, AWS as well as Nvidia and AMD. We have a virtual cloud infrastructure with dozens of regions across the world where all of these companies run their open-source models. So that's a high-level overview of our platform. Now, for the few people listening who are actually interested in the nitty-gritty details, I'm going to jump into it. As I talked about, we customize across two different levers. One is workload optimization. I don't know how many of you took ML 101, but a show of hands? No one? Okay, a couple of you. You remember grid search and hyperparameter tuning, right?

We essentially computed the space you have for deployment across quality, speed, and cost dimensions. This includes model specifications, the hardware you're using, different execution modes, kernel options, and more. It comes to around 84,000 parameters—a very large space that no one wants to run through manually. What we've developed is our proprietary technology called FireOptimizer. Essentially, clients work with us and give us what we call their workload profile. They'll tell us, for example, "I have a search use case. I want the latency to be below 200 milliseconds, and I'm going to scale to a throughput of, let's say, 200 queries per second." From that, we run it through our stack and identify the optimal setup for them to run their open-source models.

The reason we've been able to support all these different use cases—from very low latency to super complex agentic AI workloads with huge parameters—is that we've built our stack from the ground up. We've optimized every single layer in our stack, starting at the lowest layer, which is the actual CUDA kernels running in the GPUs. We have custom CUDA kernels called FireAttention. As you move up the stack, there's the question of how you deploy across multiple nodes. For example, if you're deploying a model like Llama, which is a 1 trillion parameter model, you might need 16 or 24 GPUs. How you deploy that is not trivial at all.

We have disaggregated inference, which means you're separating the prefill from the generation. We make it incredibly easy for clients to forget about all this complexity and just manage it all for them. Then at the very high level, which is the model layer, we use methods that are used across the industry, but we just use them way better. One such method is speculative decoding. For those who don't know what speculative decoding is, essentially you have a very large model paired with a very small model that's actually generating the tokens. The large model then accepts or rejects the tokens. If the acceptance rate is high, your latency can come much lower. If the acceptance rate is low, it's actually counterproductive and the latency is worse.

Thumbnail 600

We train our own draft or speculative decoder models to push the acceptance rates way above 70 percent, which makes it so you can get latencies below 100 milliseconds for certain use cases. That is the workload optimization part—all about latency, cost, and scalability. The other part is really around fine-tuning. I use the metaphor, perhaps not perfectly, but it's useful: you don't need a bazooka to kill a mosquito.

At Fireworks, we really think that a company's moat is in owning their own era, their own AI. The model is their IP, the model is their product, and the way they do it is by using their subject matter expertise and their data to fine-tune models. We have a fine-tuning platform that makes it really easy for developers to use either supervised fine-tuning for intent classification, sentiment analysis, or product catalog cleansing, which is essentially when you have images and then you're tagging them whether it's gender, whether it's shoes, or the different categories like jeans, jackets, shirts, and so on.

We also just recently released reinforcement fine-tuning. Instead of having the dataset and putting the prompt and then the response that you want the model to produce, you're writing what is called an evaluator function that scores a model between 0 and 1. The model will become better as it learns through its environment. We've seen incredible success with reinforcement fine-tuning both on coding agents as well as a very recent client success story with our partners from Genpark, where they used RFT to move from one of the cloud models to an open source Fireworks model. They had around a 20% increase in quality while reducing latency and costs.

What we do essentially is we want to make it really simple for companies to use their expertise through this data flywheel so that they can fine-tune really quickly, deploy really quickly, and whenever there's a new model drop, they can just switch the new model, fine-tune it again, and deploy again. For everybody that took their ML 101, there's covariance drift and data drift and things change. So we make it really fast for companies to just iterate through the data flywheel. We have a couple of examples here. With VLM fine-tuning, we use a closed source model, a coin model, fine-tune it on the specific dataset, and you see that we don't only match the closed source provider but we beat it. And then the same with RFT with reinforcement fine-tuning for a text-to-SQL use case. Again, we not only match but we beat the closed source provider.

Thumbnail 760

AWS Integration and Client Success Stories Across Industries

Really, it's all about owning your AI. The model is your product, the model is your IP, and a company's real moat is in their own data and in their own expertise and how they put that into their AI models for their specific use cases. Now, why am I talking about this at AWS re:Invent? Well, because we've built our entire stack on top of AWS. So essentially, we use EC2, EAS, and EKS, and we're just an inference and optimization layer that runs right on top of it. The Fireworks optimizer, the fine-tuning that I described, as well as the incredibly fast inference engine, is just this software layer that runs on top of AWS.

Thumbnail 790

Thumbnail 840

Now we work with companies from Gen AI startups that move incredibly fast and they use our SaaS platform all the way to legacy companies that have to be incredibly compliant and incredibly secure in terms of data privacy. That's why we've developed these different deployment options on top of AWS. From our SaaS platform where it's the least private, even though we do have zero data retention, it's all in our environment. And then all the way to fully air-gapped, so if you want to deploy Fireworks in a fully air-gapped environment, meaning absolutely nothing leaves your VPC, then we can do that through deploying through a Kubernetes cluster or as well as SageMaker and then everything in between. We can deploy with AWS PrivateLink where the network is secure, or as well through non-air-gap BYOC where we basically just have what we call a control plane.

Thumbnail 850

Now I'm going to talk in the last five minutes that I have about some client success stories so that you guys kind of believe me and all the things that I'm saying now. One example, and of course I'm going to start with one that I can't use the logo for, but a very large grocery delivery platform. They basically use us to fine-tune a very small Llama 10B8B model in a search use case. These are the types of uses that require incredibly low latency. They're handling around two to three million daily queries of what they call ambiguous queries.

So let's say that you're using Uber Eats or Instacart or DoorDash or whatever it is. Sometimes users will put something that is very vague, and you want to make it so that an LLM, a very fast LLM, can do things like query expansion. It'll rewrite it, maybe it'll tag it, and so on. They use Fireworks to fine-tune that model, reduce the search support tickets by 50%, and run everything at 300 milliseconds or less.

Thumbnail 920

That's not only a technical target, but they actually ran an A/B test and saw that they increased uplift once they moved to Fireworks. Another example is Notion. They're a great partner of us. If you're a Notion user, I'm a huge Notion user. A lot of good stuff. If you love Notion AI, then you'll love Fireworks AI. A lot of Notion AI runs on Fireworks AI. They've also made use of fine tuning with very small models for very fast inference. They actually moved from a closed source provider to us and saw around a 4x lower latency to 500 milliseconds and below. They've scaled to 100 million users plus. So again, incredibly low latency at scale while increasing quality through fine tuning.

Thumbnail 960

And then the last one I'll talk about is DoorDash. I talked a little bit about VLMs, vision language models. We don't only have text models. We also have VLMs and there's a lot of growing interest from clients in these use cases. One very interesting one is product catalog cleansing. These companies have millions and millions of images they want to basically label them as jeans, shorts, pants, and so on, automatically and to a high degree of fidelity and accuracy. For example, DoorDash fine tuned an open source model that is running 3x faster than the closed source model they were using before while increasing quality with fine tuning and reducing the cost by 10 percent.

Thumbnail 1020

So all in all, we provide basically a one stop shop for anything you need to build what we call magical AI applications to really match and beat closed source provider quality while keeping latency incredibly low and cost very controlled. Thank you very much. If you have any questions, we're somewhere around there. Don't tell me what the number of our area is because I have terrible working memory, but you will find us somewhere around there, and if not, I'll walk around here and I can tell you. I'm happy to answer any questions and hopefully I'll meet the 10 people that listened to me and hopefully I wasn't super boring. So anyways, thank you so much.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)