Kazuya

Posted on Dec 5

AWS re:Invent 2025 - Implement Agentic AI at the edge for industrial automation (HMC317)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Implement Agentic AI at the edge for industrial automation (HMC317)

In this video, AWS Solutions Architects Karim Akhnoukh and Mohammad Salah demonstrate deploying agentic AI on AWS Outposts to solve manufacturing downtime costing $1.4 trillion annually. They address challenges including data silos, skills gaps, and connectivity issues by implementing a unified data lake with small language models at the edge. The session covers fine-tuning Llama 3.2 3B using instructed datasets, achieving 14% performance improvement over base models through LLM as a Judge evaluation with Claude and Nova Pro. They demonstrate deploying GPT OSS as a routing model, implementing RAG with Chroma DB, and creating tools for telemetry data retrieval using Strands SDK. The live coding shows quantization strategies (FP16 to MXFP4) for deploying models on G4dn instances with NVIDIA T4 GPUs, and building a factory agent that debugs production issues by integrating real-time sensor data with machine manuals.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The $1.4 Trillion Problem: Unplanned Downtime in Manufacturing

My name is Karim Akhnoukh. I'm a Senior Solutions Architect at AWS. I've been with the company for almost three years, and I've worked with many of the big manufacturers in Germany, helping them navigate the complexity of deploying generative AI applications. My name is Mohammad Salah, and I'm working as a Solution Architect at AWS looking after the public sector in the Middle East. My job at AWS is solving similar problems to Karim's, but for real manufacturers.

Picture this: inside your factory, you have machines stopped. Your production line is paused. Everyone starts running around. Every minute your products are not made, your orders get delayed, and costs pile up. This is what we call unplanned downtime. The top 500 manufacturers globally are paying for this. They're losing $1.4 trillion US dollars due to unplanned downtime. They're losing around 11% of their revenues every year. To put this in perspective, this is equivalent to the GDP of a nation like Spain.

When we take a closer look at this, we found four different challenges. The first is data silos. You have a production line with multiple machines. Each machine has its own alarms and telemetry data, but each machine has its own separate database. You end up without an end-to-end view to understand the relationships between machines in the same production line. You have the data, but the data is disconnected.

The second problem is the skills gap. It's all about people. For each manufacturer, you have senior experts who can tell you from the sound of the mixer that there is a problem, that the dough is over mixed. They can tell you from the oven that the right side is much more heated than the left side. This kind of expertise is critical. If these senior experts are not there and a junior receives this kind of warning, they will not be able to act. You have the knowledge, but the knowledge is not shared.

The third challenge is production delays. This happens when you have the junior staff without the seniors, you have the data but it's scattered, and they're not able to act because they don't have the expertise, they don't have the knowledge, and the data is disconnected.

The last one is operational disruption. This is very important for every factory because most factories are in remote areas. When we're talking about factories, they often don't have stable internet connectivity and they don't have cloud access. If you have a smart manufacturer and it becomes disconnected for any reason, your manufacturer will be blind. You will not have data-driven decisions and insights.

Building a Unified Data Architecture on AWS Outposts with Agentic AI

That's why we thought about a simple solution: let's put all the data together in one single data link inside AWS Outposts. But when we did this, we found a different challenge. Each machine has its own integration type. You might have bought machines from OEM X, OEM Y, and OEM Z. However, each OEM has its own integration and technology. If we're talking about modern IoT integration, there's MQTT or LoRaWAN. Or we're talking about REST APIs with POST HTTP and GET HTTP. And the most important part for your manufacturer is the documents—those standard operating procedure documents and manual documents that you can get using legacy SFTP.

We thought about why not have this Outposts underlying infrastructure over EKS local cluster doing this unified API deployment to absorb those different integrations and accordingly integrate this with one consolidated data lake.

Then the consumers of this data lake will be AI, in our case Bedrock, and also to get much more insight with dashboards using Amazon QuickSight. This will be the result. You have on the left side your production line where you understand everything. You have the full details and complete understanding about your production line, and you can see the inspector is telling you that you have cracked cookies. The machine stays down for 5 minutes.

The second problem is the resiliency of what we are calling "always on"—how to stay always on. If your outposts get disconnected from the cloud for any reason, you will not be able to act as a smart manufacturer anymore. You will not be able to have this kind of data-driven decision making. You will not get automated AI recommendations. That's why we thought, why not add this intelligence at the edge by deploying multiple small language models? We'll go through this later as part of the sessions and have two different features. First, to provide intelligence for the operators using text and even voice to chat with the data. Second is to apply this kind of automation. You are digesting your data and telemetry data every 5 minutes, understanding the insights, getting the recommendations, and if you feel these recommendations require a notification, notify the operator with the required actions.

Before we jump into the coding session, which I know you're looking forward to, please allow me 2 more minutes to explain the big picture so you understand how the end-to-end workflow looks like. We imagine having a cookie factory which produces the best in class cookies here in Vegas. For simplicity, this cookie factory has 3 big machines: the cookie former, followed by the freezer tunnel, and lastly the cookie inspector. Now imagine the situation when the cookie inspector, which does the visual quality inspection, detects a cookie which is cracked, misshapen, or a cookie with some air pockets.

Then we have our Agentic AI application running inside AWS Outposts for the always-on connectivity. This Agentic AI application has access to different data sources, including the machine manuals, the standard operating procedures, and real-time telemetry data and alarms. Operators inside the production site can chat or voice chat with the agent. They can understand what the underlying problem is and then instruct the agent to itself take some actions to correct the problem inside the production site. Finally, we move from the state where we had a cracked cookie to a state where we are producing delicious cookies and everyone is happy again.

But you know what the problem is with what I'm showing here? I wish that the solution was as simple as plugging in the icon of AWS Outposts. Under the hood, there is a bit more complex architecture. On the right-hand side we have the pipeline for data preparation and model fine-tuning which we'll go into very detailed discussion for the rest of the session. On the left-hand side we have our Agentic AI application deployed on AWS Outposts, and then we have different specialized small language models. We have GPT OSS model which will be the routing model to navigate the different tasks from the users to the different tools. We have the fine-tuned Llama 3.2 billion parameter model which will be specialized into the RAG application, and we finally have the small VLM for visual quality inspection. For the sake of time, we will not be implementing the small VLM, but we will go through every other detail of the implementations.

Infrastructure Setup: Deploying Small Language Models on EC2 Instances

So let's go. That's what you want. I hope you're seeing my screen. You do, right? So I will start with setting up the infrastructure deployment.

I will be mimicking the AWS Outposts environment with a couple of EC2 machines. Let me navigate to the first method here, which is the create EC2 instance. As you can see, I'm using the G4dn.xlarge instance, which is a relatively small but capable instance that can run a small language model. I'm selecting a specific AMI, which is the NVIDIA one, and then I'm creating some security groups, VPCs, an IAM role, and so on. However, the most important part that I would like to stress on here and grab your attention is the user data script, which will be uploaded to the EC2 instances. This script will install the actual dependencies needed to run the agentic AI application. If we take a look at this EC2 machine, the user data script is 148 lines long, so it's not very realistic to go through this line by line. Instead, I would like to use your brains to brainstorm together about what dependencies should be added to this EC2 machine. Please take a moment, scan the QR code, and think about what you would need on the EC2 machine to host an AI agent and a small language model, which is the router small language model. Please think about it, and in a moment I will be showing some results.

We're still collecting the votes. So you guys almost nailed it. Of course, we need the agent code. That goes without saying. We need Docker runtime because we will use Docker to pull the container of Ollama, which will be our software for hosting the small language model. We don't need the fine-tuned model for this specific EC2 machine because we are having another EC2 machine where we will be deploying the fine-tuned model. We need the NVIDIA toolkit for the CUDA software. We need a custom Python version because the SDK that we will be using for installing the AI agent is Strands, and Strands works with a minimum Python version of 3.10. The default Python version in the G4dn instance is Python 3.9, which is why we need to install a custom Python version. And the one who said Ollama, that's actually correct because we will be running Ollama as a Docker container.

That was for the first EC2 machine for running the agent. Now please again scan the QR code and think about what dependencies we need for the EC2 machine that will hold the fine-tuned small language model. Let me view the results. Yes, we need Docker. We need the fine-tuned model. In that case, we don't need a custom Python version anymore because it works with the default version of the G4dn.xlarge. That was a tricky question. We don't need the agent code that was deployed on the first EC2 machine. And we need Ollama, and we need the small language model of course. I'm happy that we brainstormed the dependencies. After we write the user data script, like I said, we create a VPC, we create security groups, we create an IAM role with access to S3, and when we are happy with that, we need to deploy our application. Perfect. Now we have infrastructure starting to deploy. During this time, let me give you some tips and tricks on how to do a deployment of a small language model over AWS Outposts.

We'll take a closer look here. The GPU instance that we have over AWS Outposts is the G4dn. To deploy any model, you have to consider two things. First is the hardware constraint. We are talking about g4dn.12xlarge, which has 4 NVIDIA T4 GPUs, each one of them powered by 16 gigabytes of memory. At the same time, and this is the second consideration, is the model itself. We are taking an example of an open-source model, GPT OSS 20B, around 20 billion parameters. But here I mentioned different parameters, which is very important to consider while doing a deployment over AWS Outposts. One of them is whether this is a mixture of experts or not. How many layers? And the active parameters per token. Now, let's focus on the first strategy to do the deployment, which is quantization. Quantization, in a nutshell, is reducing the number of bits representing your weights in order to reduce the memory footprint. This model was trained using the precision of FP16. This is the base model. However, we can have another quantization, which is MXFP4. We are talking about how to present each weight, how to present each parameter. Instead of presenting each weight with 16 bits, you can present the same weight with 4 bits. This will make a significant reduction in the memory footprint.

Instead of having the full precision, which is the base model at 40 gigabytes, you can get only 13 gigabytes, but everything comes with a cost. The cost here is accuracy. For the base line, you don't have an accuracy impact. However, for MXFP4, because you are going to reduce the number of bits representing your weights, you will have 1 to 2 percent accuracy impact. This means that you can save 65 percent of your memory by trading this with 1 to 2 percent of accuracy loss. By doing this, you will have the full model deployed with 13 gigabytes in your memory. Do you remember the hardware constraint? We have 16 gigabytes. Now you can have the full model deployed inside one GPU.

One of the deployment strategies is to deploy the full model in one GPU. As you can see here, we are going to replicate the same model across the GPUs, and the significance of doing this is to have latency-sensitive workloads implemented by doing parallel processing for multiple clients at the same time. Your application will not have any kind of tolerance for latency. You are going to do processing in a separate GPU for each client. This improves throughput by serving four different clients without queuing.

The second strategy is tensor parallelism. You have one single model with weights shared across GPUs. Where for each GPU you will have 3.2 gigabytes for the model weights. And the remaining will be for the KV cache. KV cache is simply how to cache the tokens as part of your session in order not to let the model recalculate the token every time when generating new content. If you have a larger KV cache, this means that you can get a large number of users and at the same time you can get a larger context window, and this is important for your session.

Data Preparation and Model Fine-Tuning with Amazon SageMaker

Let me show you the foundational part of this. We started with the architecture, the high-level architecture, where we can start with data preparation. Let me show you something with the data here. As you can see, those are the data sources. You have different types of data sources. You have CSV files, you have text files, and you have PDF files.

This is what normal data sources look like. You have different files with different formats because in your production line, you have different OEMs, and each one of them has its own structure. What we need to do is fine-tune the model, but what is the objective of fine-tuning the model? The objective here is to give an assistant for your operators. This assistant will help your operators have detailed, instructed steps while considering the severity of each problem, considering compliance, considering safety, and so on.

But the most important part is to let the model understand the skills needed and let the model talk with the same tone needed. That's why we decided to use fine-tuning the model using instructed fine-tuning. Let's start with what's exactly happening under the hood. I will show you here on the right side. This is a data pipeline where you can read different documents—PDF files, text files, and document files—and then you can invoke a larger model in order to generate structured data to fine-tune your model.

The most important question here is: why don't we throw the data to the model directly to fine-tune the model? The answer is simple. If you are throwing your data right away to your model, you are training a hallucination machine. The important part here is reading the document, invoking the model to return structured data accordingly, validating that data, and then you are happy to have fully structured documents in order to fine-tune the model.

Let's start with running this. As you can see, the pipeline starts reading this and at the same time starts invoking Bedrock to generate questions and answers. Why generate questions and answers? In order to have an instructed dataset in order to fine-tune your model. Let me show you something here. This is the data preparation class that we are using to invoke the model. As you can see here, we are using this model to generate structured data. And the most important part of this is, let me show you something very important. It's a Converse API. As you can see here, we are using the Converse API.

The Converse API is giving you the flexibility of changing any model at any point in time because it uses the same API signature for different models. At the same time, the most important part is the inference configurations. If you are configuring the maximum number of tokens correctly, you will be able to get your answer correctly without having any kind of data split or something.

The most important part here is the system prompt. This is an instructed prompt with one, only one shot. I'm giving the model a clear and detailed instruction to generate realistic information, to include the severity, to answer the safety questions correctly, and include the needed information. By doing this, you will be able to have a full dataset with structured data.

Second, and this is very important: why do we need to trust the LLM? Maybe the LLM itself will hallucinate. In this case, if your LLM is hallucinating, you have to apply what we are calling deterministic validation. Yes, the LLM generated the content, but if this content is correct, if this is a right question and answer, if this is following what the standard operating procedure is telling you, if this is following even the needed structure for requests for your response in an adjacent way. Why is this important?

Because we need to make sure that we do not have a single record duplicated in your restructured data for fine-tuning the model. The second consideration is sharding. If you are starting to shard your data, you have to do it in a smart way. If you have a very long paragraph, you cannot cut the paragraph in the middle. If you have a large section, you have to make sure that this section is split in a correct way. You cannot split a sentence in the middle.

Why is this important again? The same objective applies: structured data implemented correctly. As we can see on the right side, the job is done. We have generated Q&A, done validation, and as you can see we have some statistics here. We have 50 pairs of questions and answers requested, 45 generated, and we have 5 rejected. We can check here the output file. From the output file you can see this is rejected because of the frozen belt misalignment section. This is the content and this is the critical information which is wrong. Why? Because the model itself or the pipeline itself is doing this validation across the needed structure and across even the data inside the SOP.

Accordingly, we uploaded the data inside S3 in order to start the fine-tuning of the model. This is AWS SageMaker AI Studio where you can start to fine-tune the model. If you go on the left, you can see this is JumpStart. Inside JumpStart, you can select the model. Here, the model family is Meta. This is a model family. Perfect. Llama 3.2 3B Instruct is our model. Here you will find three different options: either to evaluate, deploy, or train. In our case, we are going to train the model. I am going to select training the model.

After selecting the model, you have to select the dataset. We did this as part of the data preparation for the dataset. Here is the detailed set. Here is the data we just uploaded. We select the data, and then we need this fine-tuning will produce an output of this fine-tuning or updated weights of your model based on your dataset. You are going to set the S3 bucket to get your updated model. Accordingly, you can go through this. I think it is normal to have 5 epochs. And this is very important. This is AWS recommended recipe to do the fine-tuning using this G5 instance. If you want to select a different instance, it is fine. This is fully flexible, but this is what we are recommending as part of fine-tuning the model.

Validating Fine-Tuning Success: LLM as a Judge Evaluation

Accordingly, we are going to use the same Identity Access Management, same VPC, same encryption keys, and then we will submit the model. We have to agree on those things. And then the job started to do the fine-tuning for the model. Now, I bet that many of you are asking yourselves one question which is, in my opinion, very valid if you are asking it. The question is: why are you guys doing this model fine-tuning and data preparation when you will anyway be using a RAG task in the edge? Maybe you are asking yourself if we just added this piece of architecture in order to make our architecture look a little bit more fancy.

Well, while we may love doing so, in fact we have done extensive experimentations in order to prove that this approach is indeed valid. In order to prove that we have used a technique which is called LLM as a Judge evaluation. Let me show you how this technique works.

That's a lot. When I try, please allow me a moment. Close to men. For some reason, the aspect ratio is not working well. Let me explain how LLM as a judge evaluation works. I had some fancy animation, but anyway, we have our fine-tuned model and our base model, the Llama-based model that is produced by Meta, and we want to compare the performance of both models on the RAG task.

At the very first step, what we provide to both models is the question and the context from the test dataset. We ask both models to generate an answer based on those questions and context, because this is how you typically do it in RAG applications. As the next step, we have our three LLM as judge evaluators: Claude 4.5 Sonnet, Claude 4.5 Haiku, and Amazon Nova Pro as judges. We pass to each of those three judges five different inputs: the question and the context from the test dataset that Salah generated, the answer from the fine-tuned model from the previous step that I just showed, and the answer from the base model based on the question and context.

We also pass the ground truth answer from the test dataset. We ask each of those three models to evaluate the results generated from the fine-tuned model and the base model based on three evaluation criteria. Number one is accuracy, which measures how well the result from the fine-tuned model and the base model agrees with the ground truth answer from the test dataset. Number two is completeness, which measures how well the answer addresses all the different aspects of the question. Number three is relevance, which is, in my opinion, the most important metric in RAG applications because it measures how well the answer follows the context given to it, because this is how RAG works.

We asked the three judges to evaluate the performance of the base model and the fine-tuned model. Here we have the results of our LLM as a judge evaluation. In my opinion, these are really great results because they show that the different evaluator models demonstrate that the fine-tuned model is outperforming the base model in the RAG task for all the datasets aggregated across all three different evaluator criteria. For example, Claude 4.5 Haiku says that the fine-tuned model is outperforming the base model by 17 percent. Claude 4.5 Sonnet says that the fine-tuned model is outperforming by 15 percent, and Nova Pro says it outperforms the base model by 10 percent. On average, we have 14 percentage points of improvement for adding the fine-tuned model to the RAG application for our Agentic AI application.

Implementing the Agentic AI Application: RAG and Telemetry Tools

Let me quickly recap before jumping to the next step. We started by deploying the infrastructure to our EC2 machines in order to replicate or mimic the environment of AWS Outposts to have our Agentic AI application. Then Salah thankfully showed us how to prepare the data and fine-tune a small language model for the RAG task, and I just showed you how important this is in our architecture.

Now we need to put this together into an Agentic AI application that runs on the edge on AWS Outposts. This Agentic AI application will have access to the rack tool and also the telemetry data tool. Let's start doing that. As you can see, our two EC2 machines are up and running now: the factory agent in instance and the fine-tuned small language model instance. I would like to log in to the factory agent instance, so I'll use Kiro for that. I installed a plug-in for remote SSH. I connect to the host, and now it's opening an SSH session inside the EC2 machine which we instantiated at the beginning.

So now we are in the EC2 machine. I need to go inside the factory agent directory which I copied in my user data script. I'll zoom in a little bit. The zoom in is not working. During this time, let me explain something very important. We used two different approaches here. We used a fine-tuned model at the same time we use RAG. Does anyone have an impression why we did both? Because most of the time you are reading about one of them. Any answers? Information ensuring the right database may be. Fine-tuned model or exactly. Let me dive deep into this. The most important part here is to have knowledge about your document. The skills needed to fine-tune the model is that you are gaining the skill needed. You are setting the tone of your agent to respond correctly. By having RAG, you are retrieving up-to-date information from your knowledge base. That's why we combine both to achieve this LLM as a judge evaluation results. We are talking about a 15% increase from the base model.

So now, as Salah is talking about the RAG tool, I would like to start with showing you how to develop that RAG tool that I will be using for my agent application. I pre-created this class which is called Rag Retriever. Previously, I also in my environment pre-ingested the documents inside Chroma DB as a vector store, and I created a very simple method which is a search method that you can provide it with a query and the number of results. Then it uses this query to look up the information inside the vector store and retrieves the most relevant information from within your vector store. I used that tool decorator from Strands SDK in order to have this search documents method. Here I'm providing some docstring which will be used by the agent to understand what this tool is doing and how it can actually be used. Then we are iterating over the results of the retrieval from the vector store and we are printing them in a nice format. So this is a tool for RAG, Retrieval Augmented Generation.

Do you think in this implementation here anything is missing? Again, we have retrieval augmented generation. Anything you see missing in this tool implementation, a major thing? Just to save your time, we have the G missing, the generation missing. If you remember the model that Salah spent some time fine-tuning, we're not using that model here, right? What I would like to show you is how to use that model in order to enhance the results like I showed in the LLM as a judge evaluation. So what I will do is instantiate an Ollama client using the IP address of the machine which has the fine-tuned model with the port.

Then I need to copy and paste a couple of code parts. I changed this to response. Now we have our RAG tool complete. We have the retrieval part and we have the generation part asking Ollama to invoke the endpoint of our fine-tuned model, passing to it the query and the result. This was the first tool, and this tool was to retrieve the standard operating procedures and the manuals from the different machines inside your production line.

Now the second tool is the telemetry tool, and this telemetry tool is to retrieve real-time telemetry data and alarms from your machines inside your production line. I've marked this real-time data by a very simple CSV file, as you can see. Let me zoom in a little bit. As you can see, the CSV file has a device name, the sensor type of that specific device, the reading, and the time stamp for that reading. We have two devices if you remember from my first slide: the freezer tunnel machine and the cookie former machine. For each of those two devices we have two different sensors: the temperature and the speed. Then we have the reading for those different sensors, and finally the time stamp at that specific reading.

So for the telemetry tool, I also pre-created a class which is the TelemetryReader that reads this CSV file. I also created another helper method which converts the CSV into a time frame. We passed through this method three things: the device name that we want to look for, the current time where we are standing, and the number of minutes back in time where we want to look at. For example, if we want to look at the telemetry data for the last five minutes, we set the minutes back here to five.

Here if we provide it as an input the device name corresponds to a freezer tunnel, this is how the output can look like. So the freezer tunnel has temperature and speed sensors, and those are the readings for the different time stamps. Now I want to create a Strands tool out of that one. So let me create it as we go. If you remember, I would use the tool decorator. I would create a method called get_telemetry_data and for that method I need a device name which is of type string. I need minutes_back which is of type integer. The return should be here a formatted string.

Again, for any tool we need to have a well-structured docstring, so I will be copying the docstring from my notes here. Then we need to code the tool. This method here instantiates the class TelemetryReader. We need to have it. The TelemetryReader. In order to call the method which I showed you before, we need the device name which is passed as an input. We need the minutes_back which is also passed as an input, and we need the current time string. Luckily enough, Strands has its own implementation of current time, so we will be using that one. Current_time_str equals current_time class from Strands. Finally, now we need to call the method that I just showed you. Data frame equals tel_reader. get_device_sensors_in_timeframe. We pass the device name that we have as an input, we pass the current time string, and we pass the minutes_back.

Now we just need to formulate our results in a nice f-string. Readings for device, device name, for the last minutes back. Then we just need to concatenate the results that we get from the methods. Result plus equal JSON dumps, and then we have the data frame and we set the indentation to two. Finally, we return the results.

So as any good software engineer, I wrote a tool and I just need to test it. Let me do that very quickly in order for the application to work at the end. Get telemetry data. We need to have one device from our CSV file, which is, for example, the freezer tunnel. For the last five minutes. I always do this problem. Should be only return result. Now it works. So now you can see that we have our readings for the freezer tunnel for the last five minutes structured in a nice JSON format.

The final thing that I would like to add here is a very simple tool inside this telemetry telemetry tool, which is a tool that can be used as a lookup for the agent to understand which devices it has access to. Think about the situation here. For example, I type freezer underscore tunnel, which is the name that I just copied and pasted as is from the CSV file. In a real-life scenario, users don't know how the exact name is written inside your database, so we need to have some sort of a lookup so that if, for example, the user types with a typo or types it without the underscore or has a capital or small letter or so on and so forth, this shall work. So let me just copy and paste the last tool, list available devices. And that should be it.

Live Demo: Debugging Factory Issues with AI Agents at the Edge

Now the last fun part is the agent, which puts all those things together. I pre-created this class which is called FactoryAgent. It uses the GPT OSS model, the edge model that I deployed at the beginning of the session. It uses this big system prompt. I would like to walk you through that prompt and how I structured it because it might be relevant to you. The first thing that I pass to the prompt is some orientation to how the tools look like and what tools it has access to. The second thing is some scenarios, some possible scenarios and examples of user interactions, and based on those user interactions, what tools the agent should be using one after the other. So it's some sort of few-shot prompting here.

Then we have the system prompt, we have the model, and we have the tools I imported from the class. We instantiate the trans agent and we are good to go. Finally, we need to test our agent and see how the result looks like. Before the session, I asked my best friend Karim to create a nice UI that I will be using to test my agent, so I will be just invoking this agent, this UI. Yes, and here we have our UI.

So now since we have our agent, I would like you to put yourself into the shoes of the operator inside the cookie factory. Remember my first slide when the visual inspector has detected a problem inside the cookie? Now you are a junior operator inside this cookie factory and you want to understand what is the problem. You know that the visual quality inspection has two devices in front of it which is the freezer tunnel and the cookie former machine.

As a junior operator now, what I would do is ask a very simple question: debug freezer. That's a very simple but powerful enough prompt in order for the agent to understand or try to resonate what is wrong with this freezer tunnel. As you can see, it used the list devices tool because I didn't use the 100% correct name of the device inside my CSV file. It used the get telemetry data tool that I just coded and it used the search documents tool that we also have written for the registry.

Let's see why that. As you can see in the results, I will make it bigger in a second. What it did is it retrieved the telemetry data and it compared the telemetry data with the correct results from the documentation to see if those measurements make sense or not. In our case, it detected that the temperature is not in the correct range, and here it also suggested some follow-up actions in order to debug this machine and restore it to the operating state.

This is what happens when you have a factory start listening. Everyone and everything inside your factory are now talking with the same language. Which means machines, IoT sensors, and systems are fully integrated together, where you can have an AI system that gives you different types of recommendations, explanations, and insights about your factory. The important part here, and this is a key takeaway, is we connected the operations by having a full deployment over AWS Outposts EKS local cluster. In case of any disconnection, you still have access to your control plane, you still have access to your data plane, and you can manage your infrastructure completely in case of any disruption.

Second is deploying a small language model at the edge, which makes this kind of intelligence, real-time intelligence, possible. You're understanding what exactly you are deploying, the agents are getting the insights, and then giving a recommendation on the fly. Third, and this is a very important part, is you are still generating more data every day, and you have to stay continuously learning and continuously understanding because the data is your gold. Once you get this data, update your model, trigger the pipeline again, and fine-tune the model.

I encourage you to scan those three QR codes to do the full end-to-end deployment of agentic AI and a small language model at the edge. Now it's your turn to start with one single process. Get some data and start to understand your manufacturing production line in a more detailed way. Start small, fail fast, understand your data, structure the data in a correct and proper way, and then fine-tune the model so it speaks your language. Generate the correct information and then do the deployment on the edge over the AWS Outposts.