Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Scaling Observability for the AI Era: From GPUs to LLMs (AIM121)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Scaling Observability for the AI Era: From GPUs to LLMs (AIM121)

In this video, Ryan from Chronosphere explores AI observability across three key use cases: model training, inference hosting, and AI-native applications. He discusses challenges like GPU utilization, token economics, and model accuracy issues including hallucinations and bias. The presentation demonstrates how Chronosphere uses open-source tools like OpenTelemetry, NVIDIA DCGM, and Open Inference SDK to monitor training efficiency, detect anomalies, and track LLM-specific metrics such as token consumption and response quality across distributed AI systems.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to AI Observability: Market Landscape and Emerging Challenges

Welcome everybody. Thanks for attending our talk. I'm Ryan, a sales engineer at Chronosphere. Chronosphere is an observability company focused on open source data collection, performance and reliability at scale, and control over your telemetry, only paying for what you need. Recently we've had a lot of success at AI companies and in AI use cases.

In this short talk, we're going to cover an introduction to AI observability, the different patterns and use cases we're seeing, and how to use observability to prevent some pitfalls. We won't be able to go super deep into product demonstration, but I definitely encourage you to check out the Chronosphere booth, where we're demoing our AI guided capabilities and some of our other new features.

First, before we get into the use cases themselves, let me overview what we're seeing from a market perspective. We've broken the market down into four core buckets. We're seeing model builders, who are building foundation models that everyone else is building on top of and consuming. GPU providers are tailoring GPU infrastructure around AI inference, model training, and fine tuning use cases. AI natives are building products from the ground up around AI technology. Then we have feature builders who have existing products and capabilities where they're adding AI functionality into those existing product lines. Across the board, one thing to highlight is that observability has always been hard and continues to be a struggle, and AI is just adding complexity, adding a layer on top of that.

All of your existing large scale cloud native problem patterns definitely still exist and are at the core of AI observability use cases. Going into a little bit more depth on those challenges, what we're seeing in existing large cloud native workloads includes massive scale, really mission critical reliability, high performance, a lot of troubleshooting complexity across distributed systems, observability costs and data volume control, and managing cardinality as your infrastructure changes.

Some of the new AI specific challenges we're talking about are around model behavior, making sure the model is accurate and doing what you expect it to do. We need to manage the token economics to actually get a return on investment on the use case you're attacking. We also need to understand complex dependencies, especially if you're using MCP, RAG, or agentic architectures. And then lastly, if you're managing your own GPU infrastructure, that's largely a new component for many organizations.

Model Training and Fine-Tuning: Maximizing GPU Utilization and Training Efficiency

We're going to dive into our first use case, which is model training. Everything in this use case also applies if you're doing fine tuning. What really matters here is training efficiency, model performance as the end result, and GPU utilization. These resources are extremely costly, so it's critical you're actually getting the right utilization from your investment. Let me click through to a quick overview before we get into the observability side of the standard model development life cycle.

I'm definitely trivializing it a little bit here, but we're taking large data sets, putting that into a large compute infrastructure with GPU accelerators, and running distributed training jobs in that infrastructure with the goal of producing a trained model that we can then put out into the wild and get value from. Once the model is complete, the next step is actually hosting your inference service, whether that's externally or internally, more as a platform team.

From this point, as a user, I can say, here's my description or image of a cat, and my model can infer or predict, yes, this is indeed a cat. It's simple. If we look at a basic architecture of how you might go about this, it's all about scale, reliability, and performance.

What we're seeing in the market is that the more training cycles and the more compute you have, the bigger and better model you get. Efficient training becomes a competitive advantage, especially when everybody has access to roughly equal compute infrastructure. Looking at where the problem patterns start to occur and where we can start thinking about observability to prevent them, we start with our data sets.

Understanding that a small amount of inaccurate or invalid data can poison your entire training cycle is super critical. Understanding the metadata around your data sets and measuring how one data set versus another impacts the results you get is essential. Similarly, we have data ingestion services, and if these are slow or have spikes and errors, it's going to bottleneck your entire training pipeline. We have the model training jobs themselves, which is very similar to a traditional service or any other job you might be monitoring. We need to correlate infrastructure issues with the outcome of training. And then on the far right-hand side, we see the GPUs.

On the far right-hand side, we see the GPUs with the dollar sign on fire. Continuing to highlight, if you have downtime or low utilization, you're not only wasting money, but you're also slowing down your time to market and getting the value from what you're investing in. So at the end of the day, you're asking yourself: are we maximizing our training efficiency to stay competitive in every way that we can?

Now jumping to Chronosphere and tying this a little bit more closely to observability, what we're looking at here is a Chronosphere lens service page. This is interesting to us because Chronosphere is detecting GPU metrics from the NVIDIA DCGM Prometheus exporter. We're getting utilization, temperature, and error stats. But we know from our labeling strategy that this is supporting a specific training job. We're also getting the training metrics from our Horovod Python SDKs, which give us training accuracy, gradient norms, and samples per second. Having all this information here lets us quickly understand end-to-end what's happening in our training job.

We're looking at this from the perspective of a human looking at dashboards and service pages, but all the same value grouping and analysis applies to our AI troubleshooting tooling, our MCP and Agentic integrations. That's super critical to think about. Throughout, low latency alerting is essential. If you have XID errors and GPUs that are malfunctioning, the time from that malfunctioning GPU to getting an alert in front of an operator who can remediate is absolutely critical. And throughout, we're only keeping the data that we actually need to accomplish the use case we're pursuing.

This is a scroll down of the same service page. I'm highlighting this because we have distributed tracing with OTel, and we get out of the box this dependency map. We can see right away if there's a spike in errors or a slowdown with a data ingestion service. We have all of our telemetry in one place. So not only do we know there's an issue, but we have all the logs, the events, the metrics, and the traces to dig into and really correlate and identify the root cause, ultimately minimizing training downtime and maximizing your GPU utilization.

Inference Hosting: Ensuring Reliability and Performance at Scale

So we have a trained or fine-tuned model. That's awesome, but it doesn't provide a ton of value unless we can put it in front of users by hosting our model with an inference hosting use case. What matters here is service reliability. People are going to be building on top of this, and they need it to work, or they're going to pursue other alternatives. On the same note, it has to be fast. If it's not fast, they're waiting around and they're going to use the next tool. And ultimately, we need this to be scalable. We're investing all this time and energy into training and hosting, and we don't want to support small scale use cases. We want this to scale to many users.

So here's another architecture diagram. This one will feel very familiar to a traditional cloud native service. We just kind of have inference plugged in at the back end there. But users need fast and accurate responses across multiple client devices. The services are relying on this, so uptime and performance are critical. Namely, this last bullet point: incidents and outages can be very high impact and high visibility when we're talking about inference. You don't want to be in the news because your AI is giving incorrect or harmful information.

So looking at our problem patterns, we have front-end issues in our different UIs, upstream dependencies, and all of these supporting services can impact our reliability. We have network issues, and then again, keeping GPUs kind of always in sight when we're talking about AI use cases. It's a little bit less critical for inference and might impact only a smaller set of users, but it's still ultimately important to keep track of.

Jumping back into Chronosphere, now we're in the perspective of a platform team self-hosting some inference. We still care about all of our red metrics like we would with any other service: request, errors, and duration. But we also want some way to evaluate and benchmark the accuracy and health of inference itself. That's what you see here with our hallucination rate, biased response rate, and toxic response rates. So again, we have all the telemetry in one place, and we're correlating these different things.

One piece of positive feedback we get from our customers is that any graph in Chronosphere, you can click into and access our anomaly detection feature called differential diagnosis. For example, that spike down there, you're able to quickly identify which label is most uniquely associated with that anomaly. Is that a build version, a cluster version, a container, or something else? That's the actionable piece of information that often gets lost in the noise of a large observability implementation.

AI-Native Applications: Managing Inference Health, Token Economics, and Model Accuracy

So we're going to start shifting gears. We've talked about training and fine-tuning models. That's in our view a smaller set of organizations. What most organizations are actually doing is consuming and building on top of inference. So first, let's define this term AI native.

I think it's definitely subjective at this point, but our view when we say AI native is people who are building from day one, designing around AI technologies. One fun way to think about this and test it is if you think of a product person or a founder and they say, "What if we built a," and then you put any product category in there—it could be an IDE, an HR tool, anything—and you say, "but with AI," most likely that's going to be an AI native product.

We can see up top here our traditional architecture. We have strict schemas and data models using REST architectures. We implement these capabilities one by one behind endpoints and then access them through different client devices. But with AI, right on the bottom here, we don't need to be as concerned or as strict with our data models, and we don't need to implement every single capability individually because the LLM has the ability to reason and take requests dynamically that aren't pre-implemented, and use data that might not be structured.

What we're seeing now is functionality built around inference and tokens, reasoning and RAG capabilities, and then really optimizing around your prompt and context engineering. Another thing you might notice is that startup URLs are now maybe innovativeguy.AI instead of disruptiveproduct.io.

Some other terms we've talked about these a lot already, but just to make sure we're level setting: when we say tokens, we mean essentially the word count going in and out of the LLM. This is used to gauge throughput and also to calculate pricing. Some other key concepts are evaluations. This is taking the inputs into an LLM, looking at the output you got, and evaluating if it's what you expect, if it's healthy, and if it's doing what you needed to do. And then lastly, RAG, or retrieval augmented generation, extends the knowledge or data available to a foundation model through external data sets.

Finally getting to the inference health use case. What matters typically is your model accuracy or performance, and you're comparing that directly against the token economics and the cost required to achieve those outcomes. Ultimately, what we're seeing AI natives really strive for is product differentiation through the use of AI. Otherwise, AI might not be the tool to solve that problem.

We're going to go through some examples of concrete inference health issues that observability can help solve. The first one is probably the most common we all hear about: hallucinations. A great example of this is straightforward—asking an LLM what is OTE, and it tells me OTE is a 3D printed telescope. This is completely incorrect, but the LLM is confidently stating this as a fact. If you don't have a way to measure and evaluate this, you have no idea how often your users are experiencing that.

Another issue that can be even potentially more dangerous is bias. If you're starting to use inference in things like hiring workflows or agentic HR, and there's bias you're not aware of, the model might say which candidate should I hire, and it says you should only hire hockey fans. Maybe if you're hiring for a hockey coach that makes sense, but otherwise this could be really harmful. And then at the end here, less about the behavior of the inference and more about those token economics, is excess token consumption. If I'm asking a simple question like what letter comes after A, and the model tells me the letter after A is B followed by C, you can count the number of excess words and characters, and at scale that adds up to just wasted money and cost.

Double clicking one level further, why do these things happen? If we start with hallucinations again, largely this is something related to your own prompting and your own usage of the model. It could be inaccuracies in training data and then the lack of tools for RAG, lack of updated information. The model doesn't have the knowledge, so it tries to invent an answer that's not available to it. For bias, if you have bias in your training data, you're going to have bias in your inference. You might not have evaluations or guardrails at all, or you might just have ambiguous prompts that aren't protecting against any bias that does exist.

Excess token consumption happens if you use an MCP server—you've probably seen a model spin its wheels and infinitely make requests, just burning a ton of tokens. If you have that type of thing happening inside an agentic workflow, again you can scale it. That's a lot of wasted costs you could be spending on GPUs or elsewhere. You also might just not have output filtering. You might not be specifying a response format, so the LLM doesn't know what you want it to do and it's guessing and producing some waste.

Going back to our example where you might be hosting the inference yourself, you can look at your temperature settings and inference and model configurations. Always keep an eye on the quality of training data, as GPU performance can affect how the model behaves. It can restrict tool calls, change the behavior, and impact the accuracy of responses.

Let's return to Chronosphere one more time. How can we use observability to help us protect against these pitfalls? What we're looking at here is an OTel trace instrumented with a library called Open Inference by Arize AI. This gives us everything we get in a standard OTel trace, but it also grabs LLM-specific attributes that we can do a lot of additional analysis on. Anywhere in this trace, if there's a traditional service error or a hallucination, bias, and so on, this whole line will go red, and you'll know right away where in your agent reasoning or request the issue is.

Then we can jump in, create trends, and use our anomaly detection again. On the right-hand side, we see the span details. When I talk about these LLM-specific attributes, what do I mean? I mean stuff like this: we can see which model, which model version, the actual prompts, inputs and outputs. We can feed those into evaluation systems like Phoenix, you can create your own, or you can even run evaluations at the code level. We also get all of our token counts and these assessment attributes for hallucination, bias, or toxicity.

What this lets me do is drive all these useful trends and start analyzing the data. So if I'm an AI-native product, what do I care about? You might care about choosing the right model for the job, so we can see what's the average cost per request broken down by model. We can compare that over time, so maybe one new release actually switches this up, and the way you're doing your prompting makes it so that one model is cheaper than an alternative, for example. These are the type of things you always want to keep an eye on so you can make data-driven decisions and always improve your product and your implementation.

Then at the bottom here, we've talked a lot about hallucinations already. But again, maybe you make a change, maybe the model provider makes a change, and all of a sudden you see a spike in hallucinations. You need to action that and pull it out of production right away. This is something that I think traditionally AI and ML teams were focused on, but now if you're an SRE or a support operator and you see this happening in production, the speed of actioning that becomes critical to stay out of the news stories.

In closing, we talked about a lot of different data. One thing to highlight: Chronosphere does not have a proprietary agent. All the data we used in this talk was open source from OpenTelemetry SDKs and collectors, NVIDIA DCGM Prometheus exporter, Kube State metrics, Prometheus node exporters, the Open Inference SDK from Arize AI, Phoenix AI, and Fluent Bit for our logs. That's everything I wanted to cover. Hopefully it was valuable. I really appreciate you listening to the talk and hope you have a great rest of your re:Invent. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.