AWS re:Invent 2025 - How to prep Telemetry data for AI consumption (DVT222)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - How to prep Telemetry data for AI consumption (DVT222)

In this video, Grepr's real-time machine learning solution is presented, demonstrating how to achieve 100% observability at 10% of the cost. The speaker explains how Grepr automatically identifies patterns in logs and traces, passing through high-signal data while reducing noise by over 90%. For traces, Grepr analyzes full trace structures rather than just endpoints, enabling better performance tracking. All raw data is stored in an observability data lake for long-term access and can be backfilled when needed. The solution takes 30 minutes to deploy and works automatically without impacting developer workflows.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Grepr's Real-Time Machine Learning: Achieving Full Observability at 10% of the Cost

Hey everybody. Thanks for sticking around. I think I might be one of the last few sessions that are happening, so thank you for stopping by. Today I'm going to talk a little bit about how Grepr's real-time machine learning can actually help you get 100% of the observability that you're seeing today at 10% of the cost. I'll start with a little bit of talking about the AI for telemetry problem because that's something that people always face, and I'll talk about extracting signal from that data so you can actually feed it into AI, and I'll talk a bit about how Grepr works to get there.

So if you think about how people have been working with observability for the past, I don't know, 15 to 20 years, it all started with full stack observability where you would collect the data from the agents, send it into an application, and that application, that aggregator, is a full stack aggregator, kind of a walled garden that defines what you can actually do with that data, not more. And then over time we started seeing more openness in these observability platforms, more modularization, where OpenTelemetry came in as a protocol to separate the data collection from the data aggregators, and then we started seeing telemetry pipelines as well. And so what happens after that is we want to enable AI-powered ops and workflows, and this is what everybody's talking about today. We want to empower SREs and DevOps to be able to handle enormously more complex operations and systems.

But the biggest problem with AI systems today is that the data is just enormous and it's mostly noise, so everything that you're collecting, maybe you'll ever need maybe 1% of it. And if you were to feed all that data into an AI model and tell it, hey, figure out what's actually in there, it's really garbage in and garbage out. And so the biggest problem with using AI for observability is the ability to denoise the data and figure out how to concentrate it, so you can actually have clean data for these systems. And really that's what Grepr does.

So the way Grepr works is you start, let's say, with your existing deployments. You're collecting data, logs, traces, metrics. That data is going into your observability vendor, you know, maybe Splunk, Datadog, Grafana, whatever it is, and Grepr sits in the middle. We automatically figure out what is noisy, what's not noisy, what are all the patterns that are in your data, and we use that to figure out how much volume is actually passing through each of those patterns. And we can use that pattern to understand, okay, how do we give you full coverage of your application by passing through data for all of those patterns and making sure that we don't miss anything that might be useful. And so this is all automatic. It works out of the box. We automatically look at the data, we automatically figure out what are the patterns. We can operate on millions of patterns in the data. Today we're doing this for logs and traces, and we're building metrics next year.

So the way that Grepr works for logs is as the data's passing through in real time, we're figuring out what are the patterns in those logs. And so here in this example, we're seeing that hey, there's two patterns in the data. There's these GETs and then there's these POSTs. And what we do is we pass through those initial few samples for each of those patterns, and once we have enough samples, then we say, hey, we've seen enough of those log messages, let's actually start reducing them. And then at the end of a two-minute window, we'll send you a summary saying, hey, we've seen this much of this pattern, we've seen that much of this pattern, and we can also aggregate data inside of those patterns so that we can say things like we've seen an average latency of this much or this many bytes actually passed through.

So this allows you to get exactly the data that you need, high signal data passed through downstream, whether to AI models or down to your observability vendor. And all of this is super configurable, so you can make it be one minute instead of two minutes. You can change and decide, hey, I don't want to aggregate this pattern, I want to pass it through. We do things like automatically figure out what are the logs that are powering your dashboards and alerts, and we can automatically add them into Grepr so that we don't modify or we don't change your workflows and impact them if you roll it out.

For traces, we do something similar. So if you're familiar with tail sampling, the way that it works is you look at the endpoints of the traces, like where they're hitting, where they're getting started, and then you start looking at the performance of each of those endpoints.

But there's a problem here. What if that endpoint is sometimes cached and sometimes not cached? That means your data or your traces actually have different paths depending on whether the data is cached or not. In this example, we're seeing all of the traces starting at a circle, and we have two red ones. One has only two hops, and the other one has four hops or three hops. You want to actually look at the performance of these two paths differently, even though they start at the same endpoint.

What we do is we actually look at the full structure of every trace to map out your entire application and be able to understand all the things that we need to pass through and make sure that the end user is aware of. Then we keep track of the performance of each of those different signatures of these full structures, which allows us to understand when this particular path is slow and when that particular path is fast. We can then drop the noisy data, the stuff that's actually unnecessary, and give you full sampling across your entire application so we can cover everything. But ultimately no data is ever lost.

What we do is we put all the raw data into an observability data lake, which allows us to keep that data at low cost for a very long time. You can keep it as long as you want, and it can always be queried. If you ever want to go find a log message from six months ago, you can go and look at it. You don't have to do a hydration, but if you want to, you could go and backfill that data back into your observability vendor, or you could have it be triggered to be manually backfilled or automatically backfilled.

You can hook up this backfill to a ticketing system, for example. If a customer opens a particular ticket, maybe you want to go and load all the logs that are relevant for that customer or all the traces that are relevant for that customer. Or maybe you have some anomaly or fraud detection system that you want to hook up so that when an analyst finds that there's an anomaly, all the logs are already there in your end system for them to go and troubleshoot.

The results we've seen are very encouraging from our customers. We see over 90% reduction in many cases with very minimal impact to developer workflows. Grepr usually takes about 30 minutes to get started with. You just set it up, point your existing agents into Grepr, and Grepr automatically starts working to figure out what all the patterns are, do the compression, and send the data through.

This changes that conversation with your developers or your SREs who are trying to figure out how to even get started. If you've got 100 teams and they're all using logs or traces, and you need to really cut down this spend but you're not really sure where to start, you might wonder whether to start looking at patterns one by one and adding drop filters and sampling rates for these patterns. What we do is we just set it up automatically for you.

That changes the conversation from being a blank slate where you have to do something, learn this thing, and get certified in it, to a place where it's actually more about tuning. You set it up, it starts working, and you look at the data that's passing through. You can make decisions about whether this is good, whether this is enough, or whether you need more. You can do that as time moves on because ultimately there's no risk. The data's all in the data lake, so if you ever need something that isn't actually forwarded, you can always go back into the data lake to fetch it.

You'll always find everything there, but this makes it really easy to roll out Grepr with the assurance that your data is going to be there without impacting workflows and actually increasing MTTR. Thank you. This is a very quick talk because Grepr is actually very fast and easy to describe, but I'm happy to take any questions since we've got about 10 minutes left.

; This article is entirely auto-generated using Amazon Bedrock.