🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Advanced analytics with AWS Cost and Usage Reports (COP401)
In this video, Steph Gooch and Andy Brown demonstrate advanced AWS Cost and Usage Report (CUR) analysis techniques across three key areas: optimization, cost allocation, and data security. They show how to integrate VPC Flow Logs with CUR 2.0 to identify NAT gateway misconfigurations causing unnecessary inter-AZ data transfer costs. For AI workloads, they demonstrate cost allocation for Amazon Bedrock using inference profiles with tags and calculate unit costs by combining Lambda invocations with application spend. Finally, they implement row-level security using AWS Lake Formation to share CUR data with application teams while restricting access to only their accounts. Throughout the session, they emphasize using Amazon Q for troubleshooting, combining multiple data sources in Amazon Athena, and the importance of tagging strategies for accurate cost allocation.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: The Evolution and Challenges of AWS Cost and Usage Report
The AWS Cost and Usage Report is one of the most powerful datasets we have to understand our cloud footprint. It was actually released 11 years ago, and can you believe what it was like before that existed? We have come so far. We have Athena integration, we have CUR 2.0, we have Focus, we have the CUR query library. We have grown tremendously since then. However, some of the challenges customers have are still the same from 11 years ago. That is why we have broken down our session into three topics.
Optimization. Who doesn't want to save money in the cloud? This is a classic reason why we have the Cost and Usage Report—to find where we are spending money and to save it. But it can be quite nuanced where those opportunities are, and they can even be improved by having other datasets. Cost allocation is one of the biggest conversations I have with customers every single day. It is about being able to understand and see your spend. Securing our costs is not something we typically discuss in the world of FinOps and optimization, but we hopefully all in this room enjoy using the Cost and Usage Report and we want to allow others to do that. However, we want to be able to control the access so that they cannot see data they should not. In today's session, we are going to cover these topics and give you new ways to access your Cost and Usage Report and solve these problems.
Warming Up: A CUR Basics Quiz Game
Just to introduce myself, my name is Steph Gooch. I am a Senior SA Advocate at AWS, and joining me is Andy. Hi everyone, I am Andy Brown. I am a Senior Manager specializing in FinOps. Before we get into the coding and querying of today's session, we wanted to warm up with a little game. It is early, I know, but we thought we would get the energy flowing. So if I can get everyone to stand up if you can and would like to participate. The way this game is going to work is there will be a question on the screen, and you are going to have two options. If you think it is the first answer, you will put your left hand up. If you think it is the second answer, you will put your right hand up. It is an honesty game. If you get it wrong, you sit down. There are only four questions. The last one standing will get a very special CUR-related sticker at the end of the session that you can come and claim.
The idea is to recap on some CUR basics and make sure everyone is on the same level. First question: What version of the AWS Cost and Usage Report has the newest features—CUR Legacy or CUR 2.0? CUR 2.0. Did anyone get that wrong? Well done, folks, well done. This is the newest CUR, and this is the one that has all the features. If you have not used it, you need to start deploying it, and this is what we will be using in today's session.
Where is the data for the CUR stored? Is it S3 or RDS? S3 is the right answer. Basically, any data you can get into S3 is available for you to query in Athena, and that is what we are going to be showing you later today. You are all doing really well. I need to have a lot of stickers. Are tags automatically added to the CUR by default? Yes or no? No. The key phrase there is "by default." Tags you have to have cost allocation tags enabled, so you do not have all of your organization's hundreds of tags. We will show some tagging there. You can be very smug if you got that one right.
Alright, last query. Can you add account names into CUR 2.0? Yes or no? Yes. Yes, this is one of the new features available in CUR 2.0. And we are going to be using that as one of our ways of securing the CUR further. Brilliant. Alright, you guys go get a sticker, please sit down, and come to us at the end so you can grab one of our CUR special stickers. That was a little bit of fun to get everyone standing and moving before our session because we know it is nice and early. Thanks again for joining us. Let us get coding.
Setting the Stage for Live Coding and Interactive Queries
This is a live coding session. Like I said, we will be querying. It is live, so we will be doing it and we want to have some help, so we will be asking questions to you guys. That is the interactive part and you can shout them out. If you have questions about the session, we will be taking those outside at the end so we can get through everything today. But warning: I am dyslexic. There will be spelling errors. There might be errors. What we have done is we have a nice little tally over here. In our sections, whoever gets the most errors will have to do some kind of forfeit at re:Invent. We do this every year. Make sure that you get our LinkedIn information at the end to see what we do at the end of our session. We will make sure to share a GitHub repository in our resources at the end that has all of our queries so you can take them home. You do not need to feel that you have to make any specific notes.
We've also pre-recorded a session that will be on a YouTube channel so you can watch this back later and follow along. But with that, I'm going to hand over to Andy for our first query.
Investigating NAT Gateway Cost Anomalies with Cost Explorer and CUR
Great, thank you, Steph. So who's been in this situation before? You're at your desk doing your day job, and you get an email from AWS. Your custom anomaly detection has detected that your NAT gateway spend has increased. So you decide to go into AWS Cost Explorer, you filter by the service, you see "other," and you see yes, not only has your NAT gateway spend started to increase, we're also seeing some NAT gateway bytes and we're also seeing some data transfer regional bytes. So audience, shout out if you know what data transfer regional bytes are. Inter AZ? I think I heard "in AZ," yes. So data transfer regional bytes is inter-availability zone data transfer. So for example, an EC2 instance in US-East-1A talking to an EC2 instance in US-East-1B.
And to be honest, I don't really remember all the different types of data transfer and how they pair as usage types in the Cost and Usage Report. And so when I saw this, I just typed it into Amazon Q and it told me. But Steph, if you saw this and you got this email, what would you do? Oh, I would just push it to the developers and run away, probably, which is what you're meant to do, right? Yes, one approach, but probably not the AWS recommended way. So what I'm going to be showing you today is I'm going to be going into the Cost and Usage Report. We're going to filter and have a look at the costs, but then we'll realize that we need more information. So we're going to be integrating a new data source, and then hopefully we'll find that root cause and optimize this spend.
So with that, let's go into the console. Cool, here we go. So everything we're doing today we're doing in CUR 2.0 and we already have it set up to go into Athena. I have this query to begin with, which is designed to mimic what we're seeing in Cost Explorer, so let me run this. It's going to start coming back with the usage types coming through. You'll see we've got a variety of different NAT gateways and we've got this spend. But from this view it's still not very clear on what is causing that increase we saw in Cost Explorer.
And so the first thing we're going to be doing is we're going to be creating a quasi pivot table. I'm going to take the September and October costs and split them out into their own separate columns. So let me scroll up. I'm going to just navigate this out of the way to make it bigger. So within the sum statement, I'm going to be putting a case. Think about case as being an if statement. So case when billing_period equals and we'll do September 1st, then return the cost, else return nothing. So when this runs, only September costs will start appearing, and then so I can rename the column September. And then I just need to do exactly the same for October, so let me copy. Put in the comma, I'm going to change this to 10 for October and change it to October. Cool.
Now I can change my select statement, so I no longer need billing_period because I have my where statement to narrow it down to only September and October. I have it split out into two separate columns, so I can delete that. I'm also going to delete usage_type. Right now I don't care if it's data transfer or NAT gateway related. I just want to know which resources are causing this spike to work out what to do next, so let me delete that. I can change my group by because we're now only grouping by one resource, which is the resource ID. And I'm going to change my order by because I want to see what's the most expensive resources in October because that's where we're seeing the increase. So three, fingers crossed, so everything right, so let me run this now.
While that's running, if you do see us about to make an error, you can shout out and stop us, because often it's easier for people when you're looking at someone else's query to find an error, so feel free to shout out if you see us about to make an error. And my common one is commas. I always forget about the commas or add too many commas. So as we see here, we've got some NAT gateway spends coming through, and we've got some new spends happening in October. So we've got this one here, it's pretty much the same, so this must have been around in our environment before, but we've got two new resources.
I could have tagging on this NAT gateway, which will probably tell me who owns this or what part of the application it is. But the issue we have with that gateway and also with the mode of data transfer is we don't know why. Like, for those who don't know what that gateway is, there's something you deploy within normally private VPCs which will allow your resources to have secure access out to the internet. It's just the middleman in the process. It doesn't actually generate information itself. So in this situation, am I reaching out to the application team because maybe they changed the application deployment and something's changed? Or am I reaching out to the networking team because maybe they changed something in the configuration which is causing these two NAT gateways to spin up and the data transfer going through?
And this is where we hit the limitation of the Cost and Usage Report, and we have to try and work out what's next. I am 100% not a networking specialist at all.
Integrating VPC Flow Logs to Identify Data Transfer Issues
As always, I use Amazon Q. Amazon Q is our best friend. If I open up the panel here, hopefully it will show you the query we ran earlier. Amazon Q helped me cost optimize NAT gateways, and it tells me exactly some of the reasons why your NAT gateway bill might be increasing. It also tells me where to find extra information. Here, Amazon Q is recommending that we use VPC Flow Logs. VPC Flow Logs is something you enable in each of your VPCs, and what it does is track all the data flows within your resources so you can work out what is going on and who is talking to whom.
When you deploy it, data starts appearing within minutes, so it is not something like CloudTrail where you have to wait for maybe up to 24 hours for the data to start flowing through. You also get two choices for where the data can go. It can either go to CloudWatch, so it is with the rest of your application logs, or you can have the data flow into S3. If you choose S3, VPC Flow Logs also gives you a CloudFormation template. The CloudFormation template will do the full Amazon Athena setup for you. We enabled VPC Flow Logs in our area, we ran that CloudFormation template, and this is what we have set up within Athena today.
If I expand out the database on the left-hand side again, I can see I now have my VPC Flow Logs analysis table. I have the table here. Whenever I get a new data source, I normally expand it out and start exploring it a bit because I want to get a feel for the data before I start using it. Let me preview. I do not need all the columns anymore. As the data comes back, we are starting to get some of the information available in VPC Flow Logs. We have the account ID it is all sitting in, we have the source and destination address. We then have some port information, and then we have some byte information. For example, this row here is representing 44 bytes of information going between the two resources. We have the start and end time, which is all in Unix epoch time.
If you wanted to translate this into something more human readable, there is a function within Athena to do that for you. In our case, the daily granularity is enough, and VPC Flow Logs is automatically partitioned by day, so I do not need to worry about converting this. Let me change this into something a bit more focused, just the columns of information I want. I want the source address and the destination address. Then I am going to do a sum of the bits. I need to add in some filters now, so I need to do a WHERE statement because I am going to filter where the month is in September and October, so I can narrow it down to the months where we saw that increase in spike. I need to add in a GROUP BY 1 and 2 for my source and destination address.
I am also going to order by who is talking the most, so I am going to add in my ORDER BY, and I am going to order by 3 descending because I want to have those bigger speakers. I am keeping the limit at 10 because there are a lot of resources in my account, and I actually want to know who the biggest ones are to work out what is going on. I could just run this right now, and this will return some information, but the byte numbers are going to be really large because we have got a lot of transfer going in. What I really want to do is convert the bytes to gigabytes because that is how AWS charges for data transfer, so we can align it to what is happening in the cost report.
So audience, shout out if you know this. If I take bytes and divide it by 1,024, what do I get? Kilobytes. And another 1,024 gives me megabytes, and then the final 1,024 gives me gigabytes. I am doing that so I can then compare it to my cost and usage report. Let me run this and hopefully we will get some information back. We get information back, but all the numbers are returning 0 bytes. This is due to the way Athena handles data. If we look at the definition of the table here on the left-hand side, you will see that bytes are stored as a big integer, which is a number without any decimal places. When I am dividing by 1,024, Athena continues to ignore decimal places. You remember that row where we had that 44 bytes of transfer? You divide that by 1,024, and you are going to get zero if you round it down.
There is a way to force Athena to think about those bytes and include them within it. The easiest way to do this is to put a 0.0 at each of the ends of these numbers, so we are going to say divide everything, but I care about the decimal places. Now when I run this, hopefully I am going to get some real data coming back. Here we go. We have got the top four rows coming through. They are the biggest ones, and everything else is under about 1 gigabyte.
Analyzing IP Addresses and Discovering the Root Cause
So they're small in the whole scheme of things. But if anyone knows these IP addresses off the top of their head, well done. I don't know these IP addresses, and they don't really mean anything to me.
So once again we're back to our favorite friend, Amazon Q. This tab here represents the account where the issue is happening. This is where we deployed our VPC, and I've asked Amazon Q to explain one of the IP addresses. It's able to pick up exactly which instance ID it is, the servers we stopped, what instance type it is, and all the relevant information.
What we actually did was put all these IP addresses into Amazon Q and asked it to tell us what they are and also group them together. We have a quite large environment with auto scaling up and down. Here's what Amazon Q returned. It's using a LIKE statement because we've got auto-scaling groups, so the IP addresses of the instances are changing over time. It's categorized everything into all the different subnets we have.
We then have the two NAT gateways we've got running in the environment, both their public and their private IP addresses. And then we have this general group by. When I asked Amazon Q about some of the IP addresses, it didn't recognize them and couldn't find them within our environment. So it made a guess that they're either internet addresses, maybe files we're downloading or getting from, or maybe AWS services like S3 or RDS that we're reaching out to. In our situation, because we're just tracking what we can control, we're going to group it under here.
This will repeat again for the destination location. Let me take this and copy it. I'm going to go back to my query and replace source and destination address with those two queries. Now when I run it, we'll get a bit more useful information, and it's started to categorize.
I have a challenge for you. Can anyone recognize which row number is the issue? Row three is the issue. There are two reasons why I know this. One, I asked Amazon Q. But two, I went back to what our fundamentals in the costs are. I'll tell you why this is an issue. This is not a networking session; this is advanced analysis for cost optimization. There are two reasons. First, I mentioned that NAT gateways are designed to provide private resources access to the internet. We have a public subnet using this NAT gateway, but a public resource already has a public IP address. It doesn't need to go that way; it can just go to the internet itself. So why is it routing that way?
The second issue is that we have this public subnet in availability zone two talking to a NAT gateway in availability zone one. That's causing inter-AZ data transfer. As I said, that was purposely designed to challenge you. I wasn't expecting people to be able to identify this. So how do we actually identify what the issue is? It's down to the cost, it's all about the money at the end.
Calculating Costs and Leveraging Amazon Q for Optimization Insights
What I'm going to do next is work out what the NAT gateway and data transfer charges are for these entries within this table. I'm going to start by using a WITH statement. A WITH statement is where you create a sub-query with Athena. We're saying run everything here and we're going to use it again later. I'm going to do WITH, call it VPC AS, open the bracket, and then go all the way to the end of the query we were working on before. I want to take the semicolon away and close the bracket. So that whole thing we've been working with is now called VPC.
I can do SELECT STAR FROM VPC, and now I'm going to add in my two columns for NAT gateway and data transfer charge. I'm going to use my old friend CASE. So CASE WHEN source location is LIKE NAT or destination location is LIKE NAT, so anything which is using the NAT gateway, then I'm going to take the sum of the gigabytes and multiply it by 0.045. The reason for 0.045 is this is how much AWS charges for NAT gateway data processing within USD per gigabyte. Amazon Q told me this in my earlier query, so that's how I got it from. Everything else will be zero, and I call this as nat_data_usd.
Next statement is for the inter-availability zone data transfer. This is going to be a CASE WHEN the source location is LIKE AZ1 and we want an AND statement because we want it where AZ1 is talking to availability zone two.
So the destination location is AZ2. Then this is the sum of gigabytes, and this time we're going to multiply it by 0.02 because that's how much we charge for inter-availability zone data transfer. Then I need the same thing for the other direction, so let me copy this. We've got Availability Zone 2 talking to Availability Zone 1. And then L 0, and then this can be my data transfer USD. Fingers crossed this is all good, let me run this. It's running, always a good sign. As you can see, it is row free. All the charges—remember if you go back to that original Cost Explorer view, the NAT Gateway data processing started up, the data transfer started up, and we can see it's all row free. What this tells me now is I can reach out to my networking team to investigate. We got maybe a configuration change, maybe they did something to the route tables for that public subnet which is now forcing it to go through that gateway. It's just that extra information to provide to them.
What I want you to take away from this is to think about what other data sources you might use. I do love CUR. I wouldn't be on this stage if I didn't love the CUR, but it's not the answer for everything. Maybe it is data transfer you're investigating, and that's where VPC Flow Logs come through. Or maybe you're investigating extended support charges and why they started, and you want to bring in an inventory report. As long as you can get it into S3, you can get it to Athena. Finally, if you don't know what to do, use Amazon Q. The example I'm showing you today is a genuine customer problem I helped with, and I really had no clue what to do. I use Q for everything. I put all the information and all the results into Q, and it told me exactly what to do. It told me where to send people and who to investigate, because Q knows your AWS account. Everything I showed you today is from the free tier of Q as well, so it didn't cost me anything to run it.
Cost Allocation for Amazon Bedrock: Inference Profiles and Department Tagging
This part of the session was very much focused on a particular service—we were looking at that gateway. But what happens if you wanted to work out whether your whole application is running efficiently and if you have the right KPIs to track that? With that, I'm going to hand that over to Steph to talk a little about our favorite friend, AI. Thanks Andy. Cost allocation is one of the big challenges we see with customers, which has been made trickier by things like Amazon Bedrock when it comes to allocating that spend. We're going to focus on diving into some cost allocation for Bedrock usage and then looking at the bigger picture.
There are two big reasons why we see customers not being able to cost allocate Bedrock accurately. The first is billing entities. When you choose your models, these have an effect on the entities and these show up differently in the Cost and Usage Report. We have first-party models, which are ones that are trained, sold, and developed by AWS. Think of things like Nova and Titan. Next, we have second-party models, which are ones that are developed and trained by someone else but are sold on AWS. Think of things like Llama. Finally, we have third-party models, which are ones developed, trained, and sold by someone else, but to get them onto AWS you need to use Marketplace. This is why they show up slightly differently, and it can't be as simple as selecting where Bedrock.
The second challenge we have is that when you use a model, the resource ID just shows up as the model name. That's all the information you get by default. Unlike EC2 or like we saw with the NAT Gateways, when you get a specific ID that you can go and find in your account, you can't find it—it's just a model. But there is a way around that, and that is inference profiles. Inference profiles are a way to preconfigure how you interact with Bedrock. They allow you to choose some customization, so what model do you want to use, what temperature do you want to have, and you can tag it. Of course, tags are going to come up. You can tag this resource and then see who is using that model and where they are using it. You can do this in things like CloudFormation, and this is an example I have that's also in the GitHub repo we'll share at the end. You can reuse this in your applications, or you can use it in an SDK like Python, and you can have that. What we'll be doing is cost allocating some usage based on the department tag you can see there to see who is spending money from what department in our accounts.
Let's get into it. I already have a bit of a pre-done query with some simple stuff in it—things like product name, usage, spend, and resource, classic information you use. But I want to draw your attention down here to the WHERE statement. This is where we cover those first, second, and third-party model information.
When we have a product code like Bedrock or a product name like Bedrock, that's how we cover everything. For example, when you use Claude, it comes up with "Claude, Amazon Bedrock," so we just want to check that one line. The same applies to tokens. When it comes to first and second parties, you have line item usage type like "token" in lowercase, but you also have line item usage type like "Token" in uppercase. Notice there's no "s" because sometimes it's "token" and sometimes it's "tokens." I'm not sure why I didn't make it consistent, but we want to cover everything, so we've got these where statements in there. I've already run this query and we can start to see some of the usage.
So we see our spend, but this is where we start to see information like the resource ID. I want to show you an example. These two rows are what it looks like when you have no inference profile. You just have usage of a model, right? If you wanted to cost allocate this, it's not very useful. So we can see some people are starting to use inference profiles and see some tags. We see things like the dev teams using Amazon Nova Micro, we have security teams using Claude 3 Pro, and we even have some sales teams using Claude.
We can see this difference from that first line, and we can see who the tags are. What we're going to do first is look at some people who haven't tagged their spend. These are the naughty people. We want to make sure we can go and find those people and say, "Hey guys, you've already got the inference profile, you've already done the work. Let's just tag it. Let's add those couple of lines." So we're going to do a query just to find those people to have a bit of a hit list. We're also going to use a WITH statement.
Tokens as bracket everything up and select from tokens. So what information do we need to get? Well, we're probably going to want to grab those resource IDs so we can see what account and what the inference profile hash is. Then we're going to get some usage because we want to know who are the heavy hitters for this. We don't just want people who are doing hardly any work like POCs. We want some bigger, maybe production workloads. We'll give that a name, and then we'll also get our spend because, as Andy said, it all comes back to the money.
Because we have spend, we have to have some group by. So a group by one in this case, and we'll do a where filter. The where filter is going to be where the resource ID is like, if we look down here, we can see "application-inference-profile." That's the kind of calling card that we're using. So we can drop that in there. I don't want ones that are already tagged, so we can just filter those out. So where my user department tag that I've extracted is null.
Guys, you can put a tally up the first one. What have I done wrong? It's just like, isn't it? Well, there had to be some drama in the session. I could say I planned that. Okay, so here we have my inference profiles, a nice little list that we can go and say, "Hey, you teams, you're in this account. We can see some usage. Can you just go and tag this resource? Because we need to be able to cost allocate our applications relevantly."
But we want to cost allocate the ones that have tagged it. So a small change we need to do is change "is null" to "is not null." Get rid of that application because it's kind of moot because of that, and rather than resource ID, we can use user department. What we now see is our user departments, there's some usage, and some unblended spend. It's as simple as that. We've got some cost allocation.
But one thing I want to draw your attention to is the usage amount. What we can see is the sales teams are the big spenders that we have in our organization. They have $2.31 in spend. But if we look at things like the dev team, they have a third of the spend but almost the same usage. Does anyone know why that is? Someone said free tier. It's because of the model selection. So earlier you saw the dev teams and the security teams using Amazon Nova, which is a fraction of the cost of things like Claude. Maybe the sales team needs to use Claude for whatever they're doing, but if you're also looking to optimize your Bedrock spend, having some decisions about what models you're using will impact your cost. So if you start to see a team, a project, a department spend really spike, you could have a conversation with that team and say, "Hey, your model selection is really impacting this. Is that the right model?"
If it is, that's fine. If not, maybe consider developing a proof of concept and trying some cheaper models like the Amazon native ones.
Understanding Unit Cost: Measuring Application Efficiency and Optimization Impact
What happens when Bedrock is part of a wider project? Not just specific to its own tokens in and out, but what if we want to think bigger? What if it's part of a full story? What if we want to track optimization work and see if we've moved to a cheaper model or if we've deleted things like idle NAT gateways that we don't need or that are incorrectly set up? This is where something like unit cost comes in.
The unit cost is the expenditure to sell one unit of something or one click. The story in this image is that someone's cost is going up. The spend is growing and growing, and hopefully the project is doing well. The question is, is the spend going up because more people are using our application, more people are logging in, and more people are demanding information from our applications? Or is it because someone left something running, like that gateway, or someone chose an expensive model? How do we work that out? That is where unit cost comes in. We take the cost and divide it by things like invocations or usage of an application.
What I've done is created a simple little web application that we're going to show, which uses resources like EC2, Aurora, Lambda, and Bedrock. The question is, how much does that one click cost? That is the question we want to answer.
So what we need to do is grab all of the spend of our usage and then divide it by some kind of unit, some kind of invocation. We can do that because the driving force for this Bedrock usage is Lambda. Every time I click that button, a Lambda function gets triggered. We can actually grab that data from the Cost and Usage Report, so that can be our usage amount, and then we divide it by the cost. This is something I didn't know before we got into the session and started playing around with ideas. You can actually extract this from the Cost and Usage Report. So we're going to do that now.
Does anyone know how I grab the usage of Lambda in the Cost and Usage Report? What column it is? It's usage amount. Unsurprisingly, it's actually called usage amount. We can actually steal this from my previous query. So usage amount, if we drop that in here and we sum it, and I filter based on Lambda, we filter based on the application and the account ID and the month. If we drop that in there, we'll start to see how many times that Lambda function was hit, how many times in a couple of months people have hit that button.
Great, now we have a base. Now we need to go and get some spend. So we can group that up. It's invocations. We're just going to grab our spend for this specific application, so we'll grab it from the CUR. All we're going to need is some of our spend, we can grab that from there from our other query. It's always safer to copy and paste.
As cost, we're also going to see what products we have in here. We want to make sure that we're capturing all the right information. And finally, we need billing period because we want to check when we're doing this. We're just looking over a certain couple of months. Since we have a group by, we've got to do that sum. We have a group by one and three. And we need a where statement. Luckily, I've done a great job of tagging in this application, and so we've got a little tag we can use. So our tag.
It's because we're on the wrong dataset. If you ever have come across that situation where suddenly you can't auto create information, it's because it's not on the right table. My one is called, wrong brackets. It's a user and it's my FinOps tag equals AI. Very simple, just selecting our spend for this. You get an extra special sticker because you saved me. That's what friends do. We're not friends apparently. Alright, goodbye. And this, we also need another where statement, so we're just grabbing this month's thing. Am I safe? No.
We're at 1 and 3. Right, you can put a tally on. I think I know who's going to be doing the challenge. Catch. Yeah, the more fun you guys can see how we can fix errors. It's a more interesting show. So now we can see the resources we have and the cost as well. What is interesting is some of these are going to be quite static, so for September, my resource wasn't on the whole time, but if it was on 24/7, the same amount, it's going to be very similar cost. Things like Bedrock really differentiate because of how many tokens come in and out. When I run that little web app, sometimes it gives me a massive amount, sometimes it gives me a small amount of information back. Those things shift, so those costs are quite variable.
So now we have all of this information, we can do another one with spend. Grab all of this. I don't need my product kind of breakout anymore so I can get rid of that. Change that to 2. And now we're going to combine this data. So we're going to collect select from spend. We're going to do it in a join to connect these two things together. In a join invocations, we're going to make a little shorthand to make the combination of these a little bit easier, and we're going to select the information, so selects.co. We're going to do I. total invocations. And then the very simple arithmetic of the 1st 1 divided by the 2nd 1. We're also going to grab that billing period, because this is how we're going to join these, so we're going to join on. I. billing equals S billing. Any issues? Am I safe? Oh no. Oh I every time. Guys, I thought we were friends. Where's the support? Now we can see these things, oh, I've got to do a forfeit. Now we can see our cost per click. So we have those 89 clicks, we have similar numbers because they were about the similar amount of spend, similar things.
So what if we optimize? What if we got rid of idle resources we don't need, or what if we change model? What if we moved to Graviton? What if we right sized? What's going to happen to that per click cost? So what we can do is we're going to add in another month's worth of spend. Where I actually went and did my role, I went and optimized. And now we'll see, hopefully, that the spend per click has dropped down. So it's now, what is that? 25% of the cost by optimizing. What I hopefully want to inspire you with this is to show you how to get this data, yes, but if you are doing optimization work in your organizations, try and find some KPI you can do.
You can also do things like compute, gigabytes or you can do any kind of usage you can gather. People often try and focus on grabbing data that is part of their application usage maybe in the application. That can be quite tricky, especially if you work centrally or you don't own that specific application, to get that is typical. Whereas if you do it like this and you have it in the car, you've already got access to all of that information. So with that, tag your usage. You don't always have to tag everything in your organization, but there are some situations like this where the nuance needs to be added so you can get more detailed information for your applications.
Securing CUR Access with AWS Lake Formation: Row-Level Security Implementation
Also try getting different metadata or data for your unit cost. Even things like API usage, Lambda like we said, or there's so many other parts that you can grab data from CloudWatch that you can drop in and use as a data set to combine. With that, I'll hand it over to Andy for our last query. Thanks, Steph. So who loves data? I am 100% one of them. There is a time and a place for a dashboard, but I am most happiest when I can dive into those numbers, dig and find those optimization savings. And I think there's plenty of people like us, like our application teams.
On AWS there's a few different ways you can share cost data. Within the console for Cost Explorer and budgets, you can use billing views, which is a way to group accounts together and then share that group of information so they can go to one place. In the world of dashboards, Amazon QuickSight has native support for low level security, and our cloud intelligence dashboards has it as part of our solution. However, what about the CUR? What happens if you want to give the CUR to your application team, but you don't want to show everything to them? And that's what we're going to be covering today. We're going to be starting off with AWS's recommended way for deploying the CUR, and this is what we've been using today.
We're going to be using Athena as normal, and we're not going to be introducing anything else. The only thing we're going to be changing for security is we're going to introduce a new service into the mix, and this service is AWS Lake Formation. Lake Formation is a way for you to create a data lake with fine-grained security controls, so you can say the app team can only see this.
There are going to be four steps we're going to be following for this. First, in Lake Formation, we need to tell it that we need help managing security, and this is where the data for this is stored. In the CUR world, we'll give it the S3 bucket. We then need to create our groups, so a group of accounts or groups of certain resources, and that's what they call a data filter. We're going to then review our Athena configuration to make sure that our app team can go ahead and actually run the queries.
And then finally, everything in AWS is all about permissions and security, so we'll just be going through that and making sure they can do everything. So with that, let's switch over to the console. So what we have here, let me go up to my query. I have a very simple query just to show you what we have in our CUR table today. So filtering by line item from an account name, and this should return eight accounts we have currently in our AWS organization.
What I want to do today is I want to give access to my app team to their dev and prod accounts, but they don't want them to see anything else. In this whole world we're using the new multi-session feature, and so this tab over here is my app team. So I'm still in my FinOps account, but I'm now in my app team role where everything before today or before this session, we've been using admin.
Here you'll see they don't have any access to Athena. They're getting errors, they can't see any tables, and there's nothing you can run. Let's go back to our admin. And we'll come back to them later. So like I mentioned, step one is telling Lake Formation I need help managing access to S3 buckets, so let me go to Lake Formation.
And this is what Lake Formation looks like when you first start. By default, Lake Formation will take in all the information already stored within the AWS Glue data catalog. So everything you already created in Athena, for example, your CUR tables, any views, all that information will still be already available for you in Lake Formation. Down on the left-hand side, I need to go to my data lake locations, and I'm going to register a new location.
I need to give it the S3 bucket name, so let me browse. I know it's in CID. It's my data exports. We use one bucket for everything, so I'm going to select that. We then need to provide an IAM role. This IAM role is going to be working on behalf of the app team to actually go ahead with the data, so this IAM role needs access to everything. AWS does have a service role for you to use this, and that's what we're going to be using today. However, maybe you should speak to your security teams first to see if you want something a little bit more restrictive.
And then the last choice you have here is for permission mode. You have a choice between hybrid access mode or Lake Formation. Lake Formation basically will say that all access to this S3 bucket is now only maintained in Lake Formation. So any permissions already granted elsewhere through IAM, for example, will then be ignored. In our world, we're going to be using hybrid access mode. Hybrid access mode is going to take the permissions we're defining in Lake Formation, but also use the permissions in IAM so they work together in a hybrid way. We're good at naming sometimes.
We're doing this because we already have the Cost Intelligence Cloud Intelligence dashboard set up. We have our finance teams using the CUR today, and we don't want to break anything. But maybe in the future we want to move to Lake Formation mode so we only access permissions in one place. So step one's done. We've told Lake Formation to help us manage access to this S3 bucket. Now if I go to the data permissions tab, we'll see what's already been set up.
Like I said, it's already taken everything we've got in Athena, so if I filter by table equals CUR2, you'll see that our admin role, the one we've been using all today, has access to the table and it has access to all the columns in it. And then this is what row represents like the IAM permissions. So I now need to create a new permission for my app team and define what that means. I do this by going to data filters up here on the left.
I'm going to create a new data filter. I give it a name. I'm going to choose what access they're going to have. They're going to have access to my data explorer and then my CUR2 table, and you can restrict both column and row level. In my case for my app team, I want them to only access the unblended costs. They don't need to see any of the other cost columns that might confuse them or they might make some wrong decisions. So I'm going to do exclude columns.
I'm going to filter for the cost. I'm going to select them all. They can have access to cost category, and like I said, they can have access to unblended cost.
But everything else will be excluded. To restrict access so they can only see their account information, we're going to set up row-level access. I'm going to filter rows, and what you need to do here is complete this statement. Select star from CUR2 where. In our situation, we can do line_item_usage_account_name is like app, because we have that consistent naming convention. You can put anything in it. If you want to filter by particular products or particular resource IDs, you can put that. It's just a normal WHERE statement you would write in Athena.
One thing you can't easily do is filter by tag, and this is due to the way that CUR 2 works with tags. With CUR 2, tags are now stored in a map function called resource_tags. Athena and Glue can't see the values of the tag, so this will fail when you try to validate it. There is a workaround. What you can do in Athena is create a view where you take the relevant tags into its own separate column and do everything we're showing you today based on that view, not the table. That's the workaround, but for us, we're going to do it by account name, and so this should be sufficient for us.
So we create the data filter. What I've done is created the filter, but I haven't said who should have access to it. Let me tick the box. I'm going to grant permissions and choose the IAM role for my app team. You'll see they've already picked up all the things because I did it from the data permissions tab. You choose what access you want, so I'm going to allow them to select and describe. I'm not allowing them to drop, and I'm not going to allow them to grant any more permissions, just select and be able to run it.
Then you've got this last checkbox. In our case, we're going to tick it. Basically, this checkbox says do you want to enable this right now? You might not. Maybe there are lots of other configurations, maybe there are other data sources or other teams you want to set up, and you're not ready for them to go live. This is the only thing we're doing, so we're ready to go live, so we're going to tick it. If you don't tick this box, you can just go to hybrid access mode here on the left and set it up there. But in our case, we're ready to go. So that's the Lake Formation confirmation done for steps one and two.
Step three is all about Athena. For those who have set up Athena before, you know you need to provide Athena with an S3 bucket. If you give your app team access to this S3 bucket for the Athena results, they might actually see other people's results and get access to data they shouldn't. If they're anyone like me, they're sneaky and very good at digging around and trying to get access to data in slightly different ways. So this is not great, and we want to try to avoid this situation. What we're going to do in Athena is create a new workgroup just for this app team. A workgroup is just a new configuration setting.
Let me go back to Athena. I'm going to open up the menu here and go to workgroups. I'm going to create my workgroup and give it a name. I'm going to leave everything the same here, and then when we go to query result configuration, in the past you'd have customer managed and you would provide that S3 bucket you might have created. In our case, we're going to use a new feature called Athena managed storage. Here, Athena is going to manage the S3 bucket for you. No one in your account is going to have access to this.
This is also available free of charge, so when things are free in AWS, take advantage of it. It's free of charge and it will keep the data for 24 hours, so it's perfect for this situation because it means no one can go and see other people's results. So I'm going to create my workgroup. We're now in the final stage. The final stage is setting up the IAM permissions. I'm going to go to Identity and Access Manager. I need to go to my app team role.
You'll see I already have one policy for them. This app team has already been given access to the billing view, so they've been used to seeing a group of accounts within Cost Explorer. This is where they got the appetite to see more and they wanted access to the CUR. So I need to grant new additional access to them. I've actually already written a policy for this, mainly because policies can be quite difficult because you need to work out what's going on and make sure you get the right thing. So I created one for you. As a reminder, everything we're showing you today is available on our GitHub, including this policy, so you'll be able to take it from there and reuse it.
Let me take you through what's in this policy. First, this section is the most important one. This is the thing which gives the team access to the information in Lake Formation. So this would allow them to go and pick up that permission we set for them.
Everything else within the policy is our standard setup. So we have all the normal Athena things—they can run queries and get results. But you can see I defined the workgroup to be the one we created, so they can only run things in that workgroup and access that configuration. We then give them Glue data access so they can see the table, but once again I only give them access to that CUR2 table so they can't see anything else. So let me go ahead and tick the box. Let me add the permissions. And we are ready to go, and this is where I'm going to get some marks. Let's see what I messed up.
So I'm going to go back to Athena. First thing I'm going to check is that I haven't messed up my own access and I haven't taken anything away. So let me rerun this query here. We run it and you can see we're still getting 8 accounts, so good, my access is still stable and things are working. Now, let me go to this tab here. Oh no. Oh no, Andy gets an error. Do you mind, we should need Steph to lock me back in. Let me swap over. OK, you can keep talking. Yeah, so we're going to be going back here and fingers crossed, once Steph's logged us back in, we will be able to see that they now have access to the CUR table.
All right, let's have a look. This is what happens when you prepare too early for a talk. Here you go. Interesting. You need to go to the reinventing, yeah, maybe. Well let's see. But once Steph's got this back up and running, we'll show you exactly, and fingers crossed it is working, and what will happen is within the FinOps account, this app team will only see their two accounts, and so they can now run queries. And so what we really want you to take away from this is everything we've shown you with Lake Formation is not actually CUR-specific. It could be anything you've done in Athena. We, yes, ta da.
OK, so here we go. So we're in the FinOps account. I'm in my app team role. Yeah, OK, so you can see here on my left hand side my app team can now see the CUR2 table. So if I expand it out and I filter for those cost columns, yes, they can only see those cost columns I give them access to, they can't see anything else. And now if I go back, let me see what we've got here. Which one, here we go, OK. I'm going to take the query we had before. I'm going to paste it in here, and now this is the one with the massive fingers crossed. Let's see if it works. It should now return back only the two accounts.
There we go, so they have access, so like I say, they now have access to run queries and do things themselves but only see their data. Everything I've shown you today is not CUR-specific. So think about what other teams are reaching out to you and asking for data. Maybe you could repeat the same process. As long as you can get the data into S3, you can get it into Athena, then you can register it in Lake Formation and manage it all day. So think about maybe you're downloading cost optimization hub data exports, maybe some inventory or other reports you're producing, and you can use this solution for that.
Conclusion: Key Takeaways and Call to Action
And so today we covered three main topics. We started off with looking at how to optimize data transfer by introducing VPC Flow Logs and using that as an additional data source. We then talked about AI and worked out how do you know if your AI workload is efficient. And then finally we talked about how can we share this data with more people, but share it in a secure way. I'm going to hand it over to Steph to finish us off. We've made it to our call to actions, a couple of things for you guys to take away. Discover other data sources. We showed you VPC Flow Logs, but there are CloudWatch, S3, there's loads of other services that you can plug into Athena and start to connect this data.
Tag your usage, like we said, where relevant. I always advise an account kind of organizational structure, but if there's specific use cases for tagging, do enforce those. Enable RLS give other people the power of the CUR. Give them the chance to go in and dive into this data as well. Now I know that is a lot of stuff that we have covered today, but I want you to have a go. I want you to pick one thing from today's session and have a go in your accounts. That is it. Going to get back after re:Invent, see if it works for you.
Here's the QR code you've all been waiting for with the GitHub repo in it. It's a Linktree, there's a bunch of different resources in here, including a home recording of this session. This one is recorded and will come out later, but we have a home recording in here that you can just pop in and see the queries. There's also our LinkedIn, so you can see whatever I have to do. We've got another session. We have another session. And then we will also have some resources like inference profiles in there. And with that, thank you so much for joining us today. I can't wait to see what you do with the cost news report. Thank you so much.
; This article is entirely auto-generated using Amazon Bedrock.










































































































































Top comments (0)