🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.
Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!
Overview
📖 AWS re:Invent 2025 - Centralize Multicloud Management using AWS (COP342)
In this video, AWS Principal Technologist Clark Richey and Senior Solutions Architect Erik Weber demonstrate practical solutions for managing multi-cloud environments. They showcase AWS Systems Manager for centralized node management across AWS, Azure, GCP, and on-premises systems, including automated hybrid activation codes used by Rackspace to manage 100,000+ VMs. Erik provides live demos of patch compliance monitoring, secure SSH/RDP access without internet exposure, and automated script execution at scale. The session covers CloudWatch agent bootstrapping using State Manager associations, centralized log aggregation across accounts with no additional charges, and Azure Monitor integration. Phillips 66's case study reveals 30% reduction in mean time to recovery using Amazon Managed Grafana and Prometheus. Key features include Session Manager for secure connections, automation runbooks for orchestrating tasks, and unified observability dashboards exposing operational metrics to stakeholders without AWS console access.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: The Multi-Cloud Challenge and Session Overview
OK, good morning, everybody. Thank you for joining us today. I want to start off with just a quick bit of interactivity to get a sense of who's in the room. How many of you are actively developing in a multi-cloud environment at this point? Good, good. And those of you who, OK, keep your hands up for a second. Those of you who are actively in a multi-cloud environment, how many of you just love all the fun new stuff and tools and things you have to deal with? Yeah, very few, right? Good.
Well, that's awesome. I'm Clark Richey. I'm a Principal Technologist here at AWS, and I'm joined by Erik Weber, who's a Senior Solutions Architect at AWS. And this is exactly the problem that we're here to help you with today, all that giant amount of tools and stress you have to deal with in a multi-cloud scenario.
So we'll talk today a little bit about the goal overall for the session and the challenges that you're all keenly aware of in multi-cloud. And then we'll get into the meat of it. Erik's going to talk to you about centralized node management, observability, and we're going to give you some real customer stories on how customers are actually doing this in the real world, and of course some resources on where to go for next steps.
And really again, this is about all of you. So all of you who quickly put your hands down when I asked if this was really fun to have to deal with all the complexity and the new things you have to learn and the different tool sets, this is to help you make your jobs and your life easier. We're actually going to give you tools and techniques that are actionable and practical right now. And we're going to give you examples of that. Erik's going to give you real live demos showing you how it can be done very simply, and we'll give you some customer stories that show how customers have achieved actual return on investment and gotten time back from doing exactly these things.
So just to level set, let's define multi-cloud. I'm sure if I'd ask all of you, we'd get 100 or so different answers, and that's totally fine. For the purposes of the talk, at AWS we define being multi-cloud as running one or more of your applications in different workloads. So maybe run one in AWS, some one somewhere else. That's what we define as multi-cloud. If you just are using one primary cloud provider and say you're using a SaaS service like Office 365, that's great, no problem there, but that's not what we're really talking about when we talk about multi-cloud here at AWS.
And so of course, as we just talked about, we're all racing into multi-cloud. Many of us very intentionally to drive business outcomes that we're looking to achieve. And sometimes multi-cloud gets thrust upon us, right? We have mergers and acquisitions, and next thing you know you're in more clouds than you ever thought possible. And this of course requires new skill sets, new training, new tools, new runbooks, new processes, and we know that it can be a lot. So today we're going to hope to give you some techniques and some tips to help you manage and deal with that and make your life simple. So with that I'm going to turn it over to Erik.
Centralized Node Management Challenges and AWS Systems Manager
Awesome, thank you. So kind of recapping what was being mentioned, you know, working on different tools depending on the environment that we're referring to. So like we mentioned at the beginning, we're going to talk first about centralized node management, so moving towards getting an understanding of where our resources are and how we can operate on them after they're provisioned.
So some of the challenges that we hear in this multi-cloud space for node management is, of course, just getting an understanding of what is your current inventory. Where are you running servers and VMs? Are they EC2 on AWS? Are they VMs over in Azure or GCP? Are they even in an on-premises environment or even at the edge? So getting that foundational piece of where our resources are running is critical.
Once we have that established, then we need to start operating them. So we'll talk about observability in a moment, but some of the other challenges when you are operating these long-standing resources or even short term, you may also have to deal with patch compliance, especially when we're talking about regulatory requirements or even just internal security requirements that you may need to adhere to. Do we use ideally a centralized tool to have our patch scans in place so we know exactly where those resources are noncompliant, how long they've been noncompliant, what is the upcoming patch schedule or frequency for when that should be remediated, and to help alleviate some of those concerns you may have.
Another common challenge is direct RDP or SSH from the internet. We want to improve our overall security footprint. We want to do away with direct connections to these resources, remove Port 22 or 3389 from these inbound security group rules. Systems Manager, which we'll talk about, helps you accomplish that.
And then the last thing is talking about improving our overall automations rather than performing every single thing manually. Having that direct connection, going and logging into a node, running commands, let's talk about how we can actually perform this at scale to help speed up our DevOps processes.
So what can we use to do that? Well, AWS Systems Manager is going to be the service that we can use to manage resources wherever they are. It's based off of the Systems Manager agent or SSM agent, as you'll hear it. Let's go back. I think it might have an automatic timer. For the Systems Manager agent, we can install it on AWS resources, on-premises, at the edge, or even other cloud providers. We'll keep this here for a second. And when we're talking about AWS, a wide variety of our AMIs are even going to have it pre-baked in. Once you have that agent in place, you can then go ahead and perform all these node management related tasks or even patching as well as gathering that inventory.
Building Centralized Inventory with Hybrid Activation Codes and Automation
All right, so now we're ready to go to the slide. So let's kind of go through each of those four challenges that I highlighted at the beginning, and we can talk about how we can get started as well as then automate some of these tasks, and then we'll even jump into the console afterwards. So just building out that foundation of a centralized inventory, knowing what we need to operate and where those resources reside. So like I was just saying, it starts with the Systems Manager agent. Now on the left-hand side, we have our other cloud provider, on-premises environment, wherever this server or VM resides, and it starts with installing the Systems Manager agent. And as part of that process, you create what we call a hybrid activation code. Think of it very similar to an IAM access key and secret key, and we're passing it as part of our installation to determine which account we want to register it with, as well as which region, and then also what IAM role to grant permissions for that agent.
So if you have a connection between these two different environments, you can privately connect rather than going over the public internet. Establish a connection to the Systems Manager VPC endpoint, and this will begin that registration process. So behind each of these SSM agents and these hybrid activation codes is an IAM role that can be used by the SSM agent, as well as your other applications and resources that are on that same VM. So it'll generate temporary STS tokens, so these are going to be ephemeral. These will be constantly rotated and handled by the SSM agent, and they're made available to the operating system.
So if you do have other applications that need to access AWS resources, maybe you have a web app that needs to access an S3 bucket or maybe DynamoDB or RDS, you can grant that same role permissions and allow those applications to use the same temporary credentials that SSM agent is vending out to go ahead and access AWS resources. You can move away from having long standing hardcoded keys. So that was kind of the manual process, manually installing SSM agent and creating that code. Let's talk about how we can automate this a little bit more.
So it starts with the servers or VMs. We'll take a look at an example script when I flip over into the console, but essentially, whenever a new node is spun up, let's go ahead and bake in some user data or even just an initial startup script to go ahead and query what is going to be an API Gateway endpoint. So the query API Gateway, behind that it's backed by Lambda. Lambda is going to perform just a quick evaluation. Has this process been requested yet? If not, we'll go ahead and proceed and query what we're going to be using, which is Systems Manager Parameter Store. If you're not familiar with the Parameter Store, the name is pretty straightforward. You can store keys and values in Parameter Store and then access them for your resources.
Now, the reason that this is all kind of required is that in order to get this hybrid activation code, you could also store it in an external, if you have an on-premises key vault location, but if we're trying to access AWS resources like Parameter Store, we need AWS credentials. So that's why API Gateway is going to be that intermediary part to be able to serve that request without having credentials in the first place. So once Lambda verifies that the code is ready, it'll then deliver it back to those servers or VMs, and then the node can go ahead and register.
And this is exactly how Rackspace is actually doing it. So on the left hand side you can see GCP, you can see Azure, you can see VMware, they are going to be running these bootstrapping scripts that will then go ahead and query an API Gateway endpoint. Behind that, they have a little bit of a difference there, but it's going to be leveraging Lambda to then return back that hybrid activation code ultimately so that all these VMs that are just spun up can go ahead and register with Systems Manager. And by doing this, Rackspace is able to automate their tasks across 100,000 plus VMs on a routine basis. This is helping reduce the amount of manual effort that is required.
Live Demo: Systems Manager Unified Console and Azure VM Registration
When you have systems with the AWS Systems Manager agent installed, you can bootstrap those resources and keep them in a maintained configuration state. You can have all of that patching, the session management that we talked about, as well as just general inventory data and the ability to run scripts on them as needed. So let's go ahead and flip over into the console now.
All right, just to get started, I wanted to start on the Systems Manager landing page. Now this is something that we launched last year. It's called the Systems Manager Unified Console. It's a way from a delegated administrator account that you can aggregate data about all these managed nodes into a single location. If we just start on the initial dashboard, we can get some quick information, and this node summary is going to highlight how many of our managed nodes are currently registered with Systems Manager. If there are any unmanaged EC2 instances, it'll also highlight those and give you some options to go ahead and remediate that.
In the middle, we can then see an actual breakdown. In this simple environment, we're just running 11 EC2 instances and we have about 18 hybrid nodes. Again, hybrid can mean on-premises, other cloud, wherever it is that's outside of AWS really. Following that, we just have some initial information about what Systems Manager agents are out there, and then diving into the last two is just going to be focused around the operating systems. It gives a quick breakdown of whether it's Windows or Linux and then what is the actual operating system running on each of those.
So from this single location, we can then go to explore our nodes and we can see the accounts and regions as well as the organization units. We have this centralized piece of information so that we can then start working with our nodes and operating on them. Now, if we have a new VM that we want to spin up, like I mentioned, we can include that bootstrapping script within our VM image, or ideally, you can also pre-install the Systems Manager agent.
Just to kind of jump through this real quick, I'm over into the Azure console now. I'm going to go ahead and spin up an Azure VM, and we'll take a look at how it is able to then just automatically register through that process that we were looking at earlier. So we're not going to be doing anything too special. We're just going to go with a lot of the basics here. We're just going to launch a new Ubuntu node. This is then just going to leverage a key that I already have, and then just a few more changes.
All right, just specifying some of the networking information. Now, in this environment, there is that connection between my Azure virtual network as well as with the AWS VPCs. So it is communicating privately. It doesn't have to go over the public internet. Now, this one is still assigning a public IP, but we're just going to go ahead and remove that entry point for port 22.
Now, the last thing, this is what I was mentioning earlier about having that bootstrapping script. The real thing in here is that API Gateway endpoint. So again, we need that resource to be able to get those activation codes, which we'll take a look at what an example is, but we need that resource to be able to access these outside of AWS and prior to actually having credentials itself. All right, so we will go ahead and launch this, and this will probably take about two to four minutes to actually go ahead and register the resource.
So while we are waiting for that, I just want to make sure it hits the deployment. There we go. So let's take a look at that hybrid activation code that I've been talking about. So over in the activations section of the console, this is where I have a few different codes that have been created, but just to kind of talk about this a little bit more. Again, this is a code that's required for that resource to know which account and which region it should register with.
If you're creating it from the CLI or our SDKs, you can also add some tags that are automatically added, so you can tag these hybrid nodes just the same as EC2 instances. And the real important part that I was highlighting earlier is this IAM role section. Now in the next section, we're going to talk about observability, so how do I get observability through the CloudWatch agent? How do I get credentials to that node? Well, I can attach CloudWatch permissions to the role that is associated with this code, and then the CloudWatch agent can go ahead and ingest those credentials and be able to put logs and metrics.
So if I just create this activation code, we then just see that we have this code as well as an ID, which we would then normally include within our bootstrapping script. But these codes have an expiration date of a maximum of 30 days. So that's why having some of this automation behind an API Gateway with Lambda can help speed up that process so you are not constantly updating your existing scripts. So if we go back to the explore nodes section, let's see if it's registered now. Yeah, so we now have 30 managed resources rather than 29.
If we jump back to Azure, let's just check the IP address for this new one. It's 0.07. If we take a look at our resources here, there's our new Azure VM4 resource that we just spun up. Now it's going to go ahead and perform a few different evaluations within Systems Manager. But now that we have it as a registered device, we can go ahead and just start running our different tools on it.
There's a wide variety of tools that are available out of the box that you can use to grant access to your day-to-day operators that may need to interact with these resources. The huge benefit here is that we don't have to allow actual OS level access. They can run these commands from the AWS console, from the AWS CLI, from our APIs and SDKs, and interact with it without having to fully connect to the operating system itself.
Let's go ahead and take a look at just some performance counters. If we want to just get some on-demand metrics, we can get a quick view at the CPU utilization, maybe disk input output, memory usage, or even network traffic, just getting a quick spot check on how this node is doing after being spun up just a few minutes ago. Let me actually just pull up one of our other resources that I deployed earlier as well.
If we actually check its processes, when I select the processes tool, it's going to establish a connection with SSM agent. If you're not familiar or if you're not aware, the SSM agent works on outbound 443 traffic. Everything is done through a reverse connection to our systems and endpoints. And again, if you have those VPC endpoints deployed, the node doesn't even have to have access to the internet at all, and you can still remotely connect to it and perform these tasks.
This is going to start up that session against this hybrid resource, and now we can get a quick look at what are the live processes that are running on this node. Earlier, just before the session, I just spun up a stress NG just to go ahead and peg the server. But for any of these processes, I can then go ahead and terminate it, just go ahead and stop that stress test. This node can go back to its typical 1 or 2% CPU utilization. This is a great way to be able to interact with these resources without having to fully remotely connect.
Patch Compliance Management and Secure Remote Access with Session Manager
Just to kind of recap again, if we go back to the node insights, this is a great way that we can start establishing that centralized inventory and having these tools available to us. All right, so let's go back. We'll skip through some of these. We just talked about establishing that centralized inventory. The next step that we actually want to go into is how we can overcome some of these struggles with patch compliance.
Let's flip back into the console one more time. Well, actually, not one more time, but if we then go over to the Patch Manager section of the console, we can get a very quick glimpse of how we are doing at the moment. This is very similar to that first page, giving us a quick glimpse of how many resources are registered with Systems Manager versus those that are unmanaged. It looks like in this case, about 10 of those nodes are noncompliant with patching. We can always figure out how many nodes are actually missing updates, how many failed to perform that install, are they pending any reboots, and then how long has that resource either been noncompliant or just never reported patch compliant state in the first place.
That brand new VM that we just launched is being identified as not knowing what the current patch compliant state is. Now in this environment, I think I have daily scans configured at 2 in the morning, so that won't be flipped for a few more hours. But we can see all of our other resources have relatively fresh data. There's been patch compliance information generated within the last 7 days. We can see all of the iterations or actions that have been taken against this resource. And if we just want to go ahead and scan that new Ubuntu node, we can just select that Patch Now, and this will give a few different options. But we are just going to go with some of the basics for just performing a quick scan, and we'll just choose our node manually.
We will select one of those Ubuntu resources, and if we just hit Patch Now, this is going to kick off a patch scan. It'll probably take 30 seconds to 1 minute. So while we wait for that, let's just go ahead and talk one more time about what are some of those challenges.
We just took a very brief look at capturing patch compliance data, and you can always aggregate this information as well. Systems Manager has what we call a resource data sync to ship this data to an S3 bucket, and there you have a centralized data source for what is your patch compliance state across AWS, Azure, or wherever that resource is residing.
So moving on to our next challenge, let's talk about in case those tools or remote operations are not sufficient. Maybe in an incident or an on-call situation, you may need to actually go ahead and remotely connect. So how can we do that in a secure way rather than allowing that inbound access? If we go back to exploring our nodes here, and there it is, that is the new VM that we just provisioned.
Like what I was mentioning earlier, and we talked about EC2 instances, we can just go ahead and remotely connect to them. We don't have to allow inbound port 22 or 3389. We can use Systems Manager as essentially a network tunnel to be able to remotely connect and start up these interactive sessions. Now I am starting this from the AWS console, but you can also start up these sessions from the AWS CLI as well. However your users are connecting today, you may be able to use Session Manager to improve your overall security footprint from a networking perspective, as well as then making definitions of which nodes someone can connect to, as well as the operating system user that they're connected with through standardized IAM policies and roles.
So now that I am connected, I can just go ahead and start running all of my commands as needed. Let's just say hello, re:Invent. Now the other main thing of Session Manager is that these commands that I'm issuing, I can also specify for those to be logged. I can ship it either to CloudWatch Logs, I can ship it to an S3 bucket, and that makes it available so that after these sessions are terminated, I could always go back and review what commands were issued during that given time.
So if we jump over into the CloudWatch console now, we can go through to the log management section. In here, we have our CloudWatch log group called Session Manager. So here we have all of those different streams that we've been establishing these sessions, and now, I wish I could view this a little bit. That's a little bit better maybe. In here we can see that whoami command that I just issued, the date, and even if we go all the way to the bottom, hello, re:Invent. This is a great way to have that information available if you need to go back and reflect what actions were taken by that operator during a given session. You can then also build out alarms around it as well. If users are entering specific commands, you can kick off alarms against that.
And if we, let's check back on our patch process, so it went ahead and performed that patch scan against that resource. So now if we reflect back over to Patch Manager and if we go over to compliance reporting, let's see how many of those are actually missing. So we have a wide variety of resources again both in EC2 instance here as well as that hybrid resource that is noncompliant. We can jump in further if we want to drill into what are the specific updates that are being marked as missing, prepare any of our applications or app owners for their awareness. Here are those updates that are going to be installed.
So kind of recapping those challenges again, so now we just talked about creating that interactive session without having to access over the Internet. Again, we can have those VPC endpoints deployed if you have a connection between your other cloud and AWS VPC. It can be fully private and the node doesn't even need internet access at all.
Performing Actions at Scale with Automation Runbooks and Run Command
Now that last challenge, being able to perform actions at scale, of course, logging in like I just did, running those commands, that's not efficient as soon as we probably get above 5 or 10 nodes. So if we need to actually perform some actions at scale, that's where Systems Manager can also assist. So if we go back into the console, what I'm going to show here is an automation runbook. So if you're not familiar with automation or Run Command, this is a way to orchestrate a series of steps and whether you're performing AWS related API calls or if you need to interact with the operating system of that managed node, you can perform a series of different tasks as required. And I think as far as out of the box there are some 400 plus AWS runbooks, so it's another great way to establish a good delegation of permissions to your users.
If a user needs to restart an EC2 instance, instead of giving them direct access to that EC2 call, we should give them access to automation because we know it will be performed in a well-defined and expected manner. In my favorites, I have a simple runbook here. Let's imagine that I need to run a generic script that will be passed against this fleet of resources, whether they're AWS, on-premises, other cloud, or wherever they may be. I'm going to run a script against them.
If we take a look at the authoring experience for automation, I can zoom in here a little bit and walk through what steps are being performed. Using the automation runbook visual designer, you can drag and drop if you need to add more steps to any of your runbooks. Just to walk through this, we're going to start with describing a list of all of my nodes in this account region. Once we have that list, we can then loop through that. We're going to make a quick determination: is that resource a Windows node or is it a Linux node in order to then run the appropriate script.
If you want to also include some additional verification or branching, then we would include that logic as required. In my case, if I take a look at some of these examples and then open up one of these rules real quick, in here I'm just making two evaluation checks. One is whether the resource is a Linux box, and then two is whether the node is actually online. If it's not online, we can go ahead and skip it. We don't need to wait for that process to complete. When that is complete, or when the criteria is met by that branching option, we're then just going to run a simple shell script, just taking a look at whether the syslog or messages are present on the box.
If so, we want to go ahead and store that information in DynamoDB, specifying whether or not it's true or false if that file is present. If we flip back and go over to execute automation, this is going to use that runbook, and I will actually just show the DynamoDB table as well, just to verify that there's nothing in there at the moment. We have this managed node log compliance table here. If we explore the items, it's currently empty.
We go back to automation. In this case, everything is baked in as far as defaults for the parameters, but we can make this very dynamic as needed, even passing locations or remote locations for that script, so you don't necessarily have to have everything together. You can still use some CI/CD best practices. This is going to start iterating through all of those thirty nodes at this point within the environment, run that simple script, and then kick off an item over into DynamoDB if that file is present or not. We're not going to wait for all of these, but if we just run, we can see that we already have a few of them that are populating and bringing in that information.
Again, using automation and Run Command as a combination, you can perform these actions at scale across your resources. Something else that will also help is called a State Manager association. Think of it as a configuration definition for these resources to perform some of that bootstrapping, and we're going to talk about that in the next section. This will go ahead and keep iterating through. It looks like we're already at thirty-six steps. If we run this one more time, we now have seven nodes that have reported whether or not they're compliant.
We just talked about performing tasks at scale through that automation runbook as well as through Run Command tasks. Now we have a good idea of our foundation of node management. In this case, we have an idea of where our resources are, and we have an idea of performing our patch compliance current state. We talked about how we can securely connect to these nodes and then perform actions at scale. That brings us into the second section. Now that we have this good foundation, we need to be able to monitor those resources, make sure they're in a healthy state, and gather log information. Is my application up and running? Are there currently any issues? Then you could always go back to that same Run Command task and resolve them remotely.
Multi-Cloud Observability: Bootstrapping CloudWatch Agent and Centralizing Metrics
What are some of the challenges that we hear in a multicloud scenario with regards to observability? The first is very similar to that centralized inventory. Our resources are unmonitored. They're being spun up, and they may not have the appropriate bootstrapping configurations. Maybe our application owners are still a little bit new to thinking about observability. They may not be building their application with observability in mind, so we can have Systems Manager help with that process and put in place some initial configurations for monitoring. Similar to that, beyond just our metrics, spans, and traces, we need to also think about our logs both for these remote resources as well as just logs in general that we may be shipping.
They may be scattered across accounts and regions, and we work with customers that are running on just a handful of database accounts. It may be a handful of regions all the way up to hundreds of database accounts, multiple different regions. So how do we bring this into a centralized location so we can easily query it and get the information that is needed to help reduce things like mean time to resolution. And then finally, how can we actually expose these operational metrics outside of the database console? Typically we'll hear from our customers that they want to expose these operational metrics to their business leaders or higher level reporters that need to be able to access this data, but maybe they're not familiar with the database console or we just don't want them logging in, considering what else or what other actions they may perform. So how can we expose these operational metrics in an easier way?
So let's take a look. I'll jump over there in a moment. So talking about resources being unmonitored. Kind of going back to the Systems Manager agent that we were talking about earlier, if you recall, we have those temporary credentials that we can use, so we'll work through an example here of how we can bootstrap one of these nodes to have the CloudWatch agent installed automatically and report metrics and logs for us. So it still starts as the exact same, going through that hybrid activation code process and registering with Systems Manager. And at that point when a node first checks in, it's going to go ahead and query State Manager for any associations. Again, associations, think of it as just as a configuration tool and make a definition of what scripts or actions you want taken on that node and put it in place and then maybe periodically check once a day, once a week, whatever the appropriate frequency is.
So here it's referencing that it's going to go ahead and run a series of steps to install the CloudWatch agent using Run Command, and then that will put in place the CloudWatch agent. It'll put in place the configuration for it and then go ahead and start up the agent as well, so it is then able to leverage those same temporary security credentials that the SSM agent is periodically vending, and it will be able to report back all of that information into CloudWatch. So let's take a look in the console again. So that same node, let's close some of these windows. So that same node that we launched earlier, that Azure VM4. If we go back to our explorer nodes and we pull up that same resource. So one of the things that I'll note here is this association status. So that's exactly what I was just talking about, perform that bootstrapping operation.
If we go down into this associations list, we can see a few of them that have been applied against this node. Now the first one is just going to gather some inventory data, what applications are running, what drivers are running on the box. The second one is performing that patch scan. The third one is keeping the SSM agent up to date. And then the last one is to go ahead and install and configure the CloudWatch agent. So we see that it was successful. If we then go over, let's take a look at that association as well.
So over here in the State Manager section of the console, let's take a look at that Install CloudWatch Association. Now in here, there's just some basic definitions. We can go ahead and run this once every seven days. Again, as soon as a node first checks in with Systems Manager, it's going to go ahead and query, see that, hey, I need to apply this. This is based off of just some tags on the node. If it's provider and then Azure, let's go ahead and point to this configuration. Now what are the steps that this is actually performing? If we take a look at our underlying document, it performs just three simple steps. So this is going to look a little bit different, but if I zoom in. The first step here is just to go ahead and install the CloudWatch agent.
The second, this is a simple shell script that I've configured to essentially point the CloudWatch agent to the hybrid credentials that the agent or sorry that the Systems Manager agent is vending. And then down at the bottom just specifying that we want to use this common Parameter Store value when we are configuring the CloudWatch agent. So if we go over to Parameter Store. Let's zoom back out a little bit. And over here in Parameter Store we then have the exact configuration that our CloudWatch agent is going to use.
When you're building out what metrics are important for your business or your applications that are running, you can store this configuration within Parameter Store. You can even use multi-account sharing for Parameter Store, so from a central location we can make those definitions. We can then have these nodes, based on resource tags, make the determination of which metrics or which logs in the OS we actually want to report back to CloudWatch.
So if we go over to CloudWatch now and we scroll down, we can go to our metrics here. Let's see if I can find this. Here we go. So we want Azure VM4, and now all of these are all of those metrics that are running on the box now. The CloudWatch agent is using those credentials that the SSM agent vends out. It's able to leverage those to then interact with AWS, push metrics, push logs to this location.
One last thing that I also want to show is, beyond just the operating system metrics or logs that we need to collect, there may be other metrics over in Azure that I also want to bring in. So I also want to highlight over in the CloudWatch settings section of the console, you can go over to metric data sources, and this is where I have a direct connection with Azure Monitor. So if there are infrastructure-related metrics outside of the OS that I also want to be able to query from using CloudWatch and have this centralized data set essentially, we can go ahead and create some new data sources.
Now, I'm not going to walk through this process entirely, but when we go through configuring the data source for Azure Monitor, we pass in a few different credentials for our tenant, our client ID, and then the client's secret. This underneath is then going to spin up a CloudFormation template which will put in place a Lambda function so you can query this information as needed. So if we look at the one that is already in place, if we hit query from CloudWatch now, we can then select my Azure subscription where I have this configured. I can select my resource group, and we can select that same Azure VM4 node.
Now in this case I'm going to be taking a look at that same percent CPU average in this case, but like I was mentioning, this is where we can then monitor any metric that is being provided by Azure Monitor and bring it into a central location over in CloudWatch. All right, so now we got an idea of having our resources monitored wherever they may be, as soon as they are spun up, and getting that information collected into a central location.
Centralizing Logs Across Accounts with CloudWatch Log Rules and Unified Data Storage
Now the second challenge, very similar, but in this case we're going to be talking about our logs are scattered. They're in our dev environment, they're in our production environment. When users are responding to an incident, they have to make sure, am I logging into the right account at this time? Now that's an unnecessary burden, and actually just a few weeks back, what we're going to be taking a look at is a way that you can define rules to bring this information into a centralized account.
So very simple example, let's just say that we're going to be building out some rules that we want to say for our production logs as well as our development logs, let's bring those into our observability account. And you can also specify that you want that same log information to be sent to a backup region. The awesome thing about this is bringing it into that first destination region is available at no additional charges, so you can have all of this information readily available for you that you can then use the various suite of CloudWatch functionality to then interact with those logs, whether you need to go use Log Insights, if you want to establish some data protection to take a look for any PII information, or if you want to then just query it using other log analytics tools like OpenSearch.
So if we flip into the console again, let's go back over to CloudWatch. So earlier this week, we also announced the unified data storage experience within CloudWatch, and of course, I wanted to just show that along the same demonstration. So from this observability account, from this delegated admin account, we can get a quick glimpse of where all of our logs are being ingested by CloudWatch across different accounts, OUs, regions, get some initial metrics around it, any anomalies that have been detected, really just kind of see the entire state of all of our logs as required. And if we go over to these data sources section, this is where you can, if I hit enable here,
this also launched with a variety of third-party data sources that you can establish a connection with and set up these pipelines. If we just filter for our third-party ones, you can bring in data from these various third-party sources like CrowdStrike, Microsoft 365, Palo Alto, GitHub, Okta, and so on. We can have all of this information flow in through these CloudWatch pipelines.
The other major thing that you can do is perform transformation on them. If those logs need to be transformed into a well-defined structure like OCSF, you can have that as part of your pipeline process. Now the rules that I wanted to show, this is how we can establish some telemetry rules. Here is an example where I'm specifying that CloudTrail should be enabled. We're going to report it back to a log group here and then some retention information. This is a great way to establish some governance around those accounts, making sure CloudTrail is enabled in those account regions and making sure VPC flow logs are being defined and configured.
If we jump over into our organization settings, we can then see those rules to actually centralize the log information. Here in this example rule, I'm just defining for the most part that I just want to gather every single log group within my environment. Now you can of course provide filters, so if you only want to capture specific log groups, down here at the bottom you can see my example, which is just saying if it's a CloudTrail log group, we don't need to centralize it. But you can also use that as an example for your other log groups as needed.
The really important part here is that when we talk about querying, let's just be extreme and say that we just do a select star. We don't need to query all of our log groups. There's going to be some that are more important for our security operations or just day-to-day operations that need to be pulled in versus application owners. They may have a different set of log groups that they are interested in. Pulling in all of this information, if we were now to go back to that log management section and if I go to that same Session Manager log group that is located in my other account, we can then see that same data. It's hard to tell, but in that log stream name that I have highlighted right now, that is for that member account where that new Ubuntu resource was registered, and now I'm in my delegated 9750 account instead.
I can see all of those same commands that were just issued in that member account, so I can just go ahead and get that information as soon as possible when I am responding to incidents that are occurring. I don't have to worry about jumping between different accounts, making sure I have access to those accounts or those roles, and we can have all of this information pulled in centrally and made available for us.
Exposing Operational Metrics with Amazon Managed Grafana and Phillips 66 Success Story
We just talked about how we can bring all of those logs across our accounts and regions, and the last thing is, now that we have this data in CloudWatch, what do we do with it? How can we expose this information to our key stakeholders? We will flip back very quickly into Amazon Grafana workspace. We're going to take a look at a Grafana workspace now. This is the Amazon Managed Grafana, but of course if you are using the open source Grafana, that is perfectly suitable as well.
If we log into Grafana, it is asking me to grab my authenticator. Now again, the reason that you may want to use Grafana is really just to be able to integrate with your identity providers, this example just being Okta. The main important part is that you're able to expose all of this information. Let me grab my password, very secure. How can we expose this information without requiring that the user is able to connect to the AWS console, go about our different service pages, and make it as simple as possible for some of those key stakeholders?
Here we just have a few different example dashboards, but I'm just going to pull up one for the moment. Now this is based off of our One Observability Workshop. There we have a sample pet site application we can monitor, our metrics and logs about how is our pet site doing. Are pets getting adopted? Are they experiencing any issues with that entire process as they work through the adoption phase? We can make all of this information available to all of those stakeholders. So just some simple Amazon EKS and Amazon ECS metrics.
If we go over to our connections here, we can even see that we have a few different data sources that are configured right now, including CloudWatch, X-ray, and then Azure Monitor.
So if we go to explore our information now, this is still selecting that Azure Monitor metric source. If we drill down here, we're going to select that same, actually, we're going to select VM2. This is the one that I was running those stress tests on earlier. So if we then refresh, we can see when we use Systems Manager to kill that process, now the node is returning back to its normal day-to-day boring CPU usage of only 0.7%.
All right. So let's kind of go back and just recap one more time, just a moment. So talking about how we can set up some centralized observability, it starts with putting in place, getting an understanding of our applications, our metrics, our logs, using Systems Manager as a bootstrapping tool to get the agent in place and ready to report. Following that, we can use CloudWatch log rule centralization to bring all of this data into a single account so we can easily access it, and then we can expose those metrics using things like Amazon Managed Grafana or even any other dashboards that are needed. And Clark is now going to talk about how Phillips 66 was able to achieve some success with this.
Thank you, Eric. That was fantastic. I love seeing those demos. So as we promised, these are real actionable techniques and tools you can start using today or when you leave here, right? And Phillips 66 is just another example of a customer who's done this successfully and really achieved some significant benefits. So you see that they've increased their visibility and their productivity by using Amazon Managed Grafana and Amazon Managed Service for Prometheus. So up to 30% reduction in their mean time to recover. So when they do have an impairment, they're able to recover much more quickly. That of course has a significant impact to their business and brings them tremendous value. And that now they have a single pane of glass for their observability for all of their services.
Of course we got that single pane of glass, it makes everything easier. You don't have to start switching between different views. You don't get confused about where you're actually looking at and it allows you to resolve things quicker, understand what's happening in a much faster way. And with fewer tools, you have fewer headaches, fewer things to worry about, fewer things you need to train all of your DevOps, your SREs, your engineers on. And this is also going to give you a cost optimization. Fewer tools, fewer costs.
And of course, if you can observe things centrally and you can manage them centrally, including your security, this is going to increase your security posture. If you can clearly see what things are compliant, what are not compliant, where are patches, where your security policies applied, ensure that they're done correctly by having it in one central place, that's going to give you the ability to have a much greatly improved security posture. And of course you can use both cloud native and open source managed services, so we showed you the Amazon Managed Grafana, but as Eric mentioned, you can use the open source version as well, so you have the flexibility to use whatever tool set meets your organizational needs.
AWS Multi-Cloud Timeline, Resources, and Next Steps
Eric showed you a couple of tools, but we of course have other ways that we can help you with this at AWS. And believe it or not, this is not the first year we've done multi-cloud at AWS. I know you've probably heard a lot about multi-cloud this year, but we've been doing multi-cloud at AWS in one way or another since 2016, as you can see in this timeline. Of course we're continuing to add new capabilities, new features, new services, new functionality to help all of you manage your systems as you grow into multi-cloud customers. And so here again we're just highlighting a few of those, but really again the key is we're continuing to grow this list, so expect more and more developments in the multi-cloud front as we continue to build services to support you and your business.
So what next? Hopefully you really enjoyed this talk. We have some QR codes here that you can go ahead and scan to get more information, including information on our hybrid multi-cloud solutions, as well as our Amazon.com blogs site. So go ahead and take a moment there. You can see all the newest posts, you can filter them by category or tags and search for them. I see a lot of phones up, so I'm going to give you a minute to look at those.
We also do have some booths in the studio. So here we can go. So some very specific blogs we do want to call out again though, one of our searching through Session Manager as Eric showed you today and as well another one on observability with Azure using CloudWatch. We've got a Cloud Ops kiosk as well. Please stop by and see us. I know we're only open until 4 o'clock today. So if you get time, we've got socks and we've got demos and stickers and fun stuff like that. And if you have any kind of questions or anything like that, there's always going to be somebody there who's more than happy to help you.
So with that, I would do want to thank you and please encourage you to take a moment and fill out your survey in your application. We take your feedback.
; This article is entirely auto-generated using Amazon Bedrock.



























































































































































Top comments (0)