DEV Community

Martin Nanchev for AWS Community Builders

Posted on

2

Forecast AWS charges using cost and usage report, AWS Glue databrew and Amazon forecast

Ever since I started working with AWS, I was always wondering if my cost estimations are correct. One problem is that cost estimation depends heavily on assumptions(like no seasonality), which are not always correct. AWS added the possibility to forecast monthly costs using cost explorer. The problem is that sometimes you want to estimate the costs for specific service and when you tried using AWS Cost Explorer you get:

You cannot group costs by service. This means that you cannot forecast the costs for EC2 after a several days of usage

In such cases there is another possibility to use Cost and Usage report, together with AWS Glue Databrew and Amazon forecast to forecast expenses for specific service based on 1 to n historical values daily or hourly. Below is a proposed architecture:

Costs forecast using Cost and Usage report, AWS Glue Databrew and AWS Forecast

The architecture needs two buckets. One Bucket for storing the cost and usage report and one for storing the forecast output. The cost and usage report should also be configured. Below is a sample configuration:

// Creation of databrew role used by forecast and databrew
const dataBrewRole = new Role(this, 'costAndUsageReportRole', {
roleName: 'dataBrewServiceRole',
assumedBy: new CompositePrincipal(
new ServicePrincipal('databrew.amazonaws.com'),
new ServicePrincipal('forecast.amazonaws.com'),
),
path: '/service-role/',
});
// create a bucket to store cost and usage report with aws managed encryption and versioning
const reportBucket = new Bucket(this, 'costAndUsageReportBucket', {
encryption: BucketEncryption.S3_MANAGED,
bucketName: 'cost-and-usage-report-2021-12-12',
versioned: true,
autoDeleteObjects: true,
removalPolicy: RemovalPolicy.DESTROY,
});
// add read permissions for billingreport to put cost and usage report and databrew to get the report
// and transform the data
reportBucket.addToResourcePolicy(
new PolicyStatement({
resources: [reportBucket.arnForObjects('*'), reportBucket.bucketArn],
actions: ['s3:GetBucketAcl', 's3:GetBucketPolicy', 's3:PutObject', 's3:GetObject'],
principals: [
new ServicePrincipal('billingreports.amazonaws.com'),
new ServicePrincipal('databrew.amazonaws.com'),
new AccountPrincipal(this.account),
],
}),
);
// Deploy a sample cost and usage report to use it for test
const prefixCreation = new BucketDeployment(this, 'PrefixCreator', {
sources: [Source.asset('./assets')],
destinationBucket: reportBucket,
destinationKeyPrefix: `2021`, // optional prefix in destination bucket
});
// add dependency to put the file after the report bucket was created
prefixCreation.node.addDependency(reportBucket);
// Create cost and usage report
// We use parquet because it is highly optimized and offers
// a good value for speed/storage
// A new report version will be created for each day.
// An alternative is OVERWRITE_REPORT, because it saves storage and we
// already have versioning enabled. Th problem is that this files will
//grow bigger each day and i would suggest to create a new file for each new year.
// A S3 lifecycle policy will also be a good idea
new CfnReportDefinition(this, 'costAndUsageReport', {
compression: 'Parquet',
format: 'Parquet',
refreshClosedReports: true,
reportName: 'cost-and-usage-report-2021-12-12',
reportVersioning: 'CREATE_NEW_REPORT',
s3Bucket: 'cost-and-usage-report-2021-12-12',
s3Prefix: '2021',
s3Region: 'us-east-1',
timeUnit: 'HOURLY',
}).addDependsOn(
reportBucket.node.defaultChild as CfnBucket,
);
// We grant dataBrwRole read and write permissions to both buckets
outputBucket.grantReadWrite(dataBrewRole);
reportBucket.grantReadWrite(dataBrewRole);

After the report and the required buckets are in place, we need to create a data-set in Glue databrew, a recipe, which transforms the data-set using discrete transformation steps. The glue between the data-set and the recipe is called databrew project, which connects the recipe to the data-set. After the project is available we have the possibility to schedule a job each day, which transforms and cleans the cost and usage report to be ready for the amazon forecast:
// We create the dataset which reads the parquet files in the 2021 bucket prefix
const cfnDataset = new CfnDataset(this, 'Dataset', {
name: 'cost-and-usage-report-dataset',
input: {
s3InputDefinition: {
bucket: `cost-and-usage-report-dataset-2021-12-12`,
key: `2021/<[^/]+>.parquet`,
},
},
format: 'PARQUET',
});
// The recipe groups the costs by service and account is and sums them up
// As next action it changes the date format to the required by amazon forecast by creating a new colum for it
// As last step it removes the redundant information by removing the date column, which was not transformed
const recipe = new CfnRecipe(this, 'dataBrewRecipe', {
name: 'cost-and-usage-report-recipe',
steps: [
{
action: {
operation: 'GROUP_BY',
parameters: {
groupByAggFunctionOptions:
'[{"sourceColumnName":"line_item_unblended_cost","targetColumnName":"line_item_unblended_cost_sum","targetColumnDataType":"double","functionName":"SUM"}]',
sourceColumns: '["line_item_usage_start_date","product_product_name","line_item_usage_account_id"]',
useNewDataFrame: 'true',
},
},
},
{
action: {
operation: 'DATE_FORMAT',
parameters: {
dateTimeFormat: 'yyyy-mm-dd',
functionStepType: 'DATE_FORMAT',
sourceColumn: 'line_item_usage_start_date',
targetColumn: 'line_item_usage_start_date_DATEFORMAT',
},
},
},
{
action: {
operation: 'DELETE',
parameters: {
sourceColumns: '["line_item_usage_start_date"]',
},
},
},
],
});
// The recipe depends on the cost and usage report presence in S3
recipe.node.addDependency(prefixCreation);
const cfnProject = new CfnProject(this, 'dataBrewProject', CfnProjectProps = {
datasetName: 'cost-and-usage-report-dataset',
name: `cost-and-usage-report-forecasting-project`,
recipeName: `cost-and-usage-report-recipe`,
roleArn: `arn:aws:iam::559706524079:role/service-role/dataBrewServiceRole`,
};
cfnProject.addDependsOn(recipe);
cfnProject.addDependsOn(cfnDataset);
// Ater the recipe, project and dataset are created, we will need to publish the recipe,
// using custom resource, which implements onUpdate and onDelete lifecycles
const publishRecipe = new AwsCustomResource(this, `publishRecipe`, {
onUpdate: {
service: 'DataBrew',
action: 'publishRecipe',
parameters: {
Name: recipe.name,
},
physicalResourceId: { id: `publishRecipe` },
},
onDelete: {
service: 'DataBrew',
action: 'deleteRecipeVersion',
parameters: {
Name: `${recipe.name}` /* required */,
RecipeVersion: '1.0',
},
},
policy: AwsCustomResourcePolicy.fromSdkCalls({ resources: AwsCustomResourcePolicy.ANY_RESOURCE }),
});
publishRecipe.node.addDependency(recipe);
// Last step is to create a scheduled job, which executes the project (recipe on the dataset)
const cfnJob = new CfnJob(this, 'dataBrewRecipeJob', {
type: 'RECIPE',
projectName: 'cost-and-usage-report-forecasting-project'
name: `cost-and-usage-report-job`,
outputs: [
{
//compressionFormat: "GZIP",
format: 'CSV',
location: {
bucket: outputBucket.bucketName,
key: `cost-and-usage-report-output`,
},
overwrite: true,
},
],
roleArn: dataBrewRole.roleArn,
});
cfnJob.addDependsOn(cfnProject);
//Job schedule
new CfnSchedule(this, 'dataBrewJobSchedule', {
cronExpression: 'Cron(0 23 * * ? *)',
name: `cost-and-usage-report-job-schedule`,
jobNames: [`cost-and-usage-report-job`],
}).addDependsOn(cfnJob);
// start the databrew job to run once before the schedule
const startDataBrewJob = new AwsCustomResource(this, `startDataBrewJob`, {
onUpdate: {
service: 'DataBrew',
action: 'startJobRun',
parameters: {
Name: `cost-and-usage-report-job`,
},
physicalResourceId: { id: `startDataBrewJob` },
},
policy: AwsCustomResourcePolicy.fromSdkCalls({ resources: AwsCustomResourcePolicy.ANY_RESOURCE }),
});
startDataBrewJob.node.addDependency(cfnJob);
view raw Databrew.ts hosted with ❤ by GitHub

AWS Glue Databrew looks like excel macro on steroids, it automates the transformation and cleaning of large datasets. Example:

Databrew project view

The results of the transformation job are saved to the output bucket as csv and serves as input for the forecast. The data from the parquet is divided into multiple parts:

CSV results after transformation of the parquet files

With S3 select query we can check the columns and values in one of the csv objects:

The amazon forecast is created using custom resource because there is not forecast resource in CDK:

// First we create forecast dataset with datafrequency of 1
// We use a timeseries AutoML, the target column is costs
const forecastDataset = new AwsCustomResource(this, `forecastDataset`, {
onUpdate: {
service: 'ForecastService',
action: 'createDataset',
parameters: {
Domain: 'CUSTOM',
DatasetName: 'amazonForecastDataset',
DataFrequency: 'D',
Schema: {
Attributes: [
{
AttributeName: 'timestamp',
AttributeType: 'timestamp',
},
{
AttributeName: 'item_id',
AttributeType: 'string',
},
{
AttributeName: 'account_id',
AttributeType: 'string',
},
{
AttributeName: 'target_value',
AttributeType: 'float',
},
],
},
DatasetType: 'TARGET_TIME_SERIES',
},
physicalResourceId: { id: `forecastDataset` },
},
onDelete: {
service: 'ForecastService',
action: 'deleteDataset',
parameters: {
DatasetArn: `arn:aws:forecast:${this.region}:${this.account}:dataset/amazonForecastDataset`,
},
},
policy: AwsCustomResourcePolicy.fromSdkCalls({ resources: AwsCustomResourcePolicy.ANY_RESOURCE }),
});
// We create a datasetgroup from the dataset
const forecastDatasetGroup = new AwsCustomResource(this, `forecastDatasetGroup`, {
onUpdate: {
service: 'ForecastService',
action: 'createDatasetGroup',
parameters: {
DatasetGroupName: 'amazonForecastDatasetGroup',
Domain: 'CUSTOM',
DatasetArns: [`arn:aws:forecast:us-east-1:${this.account}:dataset/amazonForecastDataset`],
},
physicalResourceId: { id: `forecastDatasetGroup` },
},
onDelete: {
service: 'ForecastService',
action: 'deleteDatasetGroup',
parameters: {
DatasetGroupArn: `arn:aws:forecast:${this.region}:${this.account}:dataset-group/amazonForecastDatasetGroup` /* required */,
},
},
policy: AwsCustomResourcePolicy.fromStatements([
new PolicyStatement({
actions: [
'forecast:CreateDatasetGroup',
'forecast:DeleteDatasetGroup',
'logs:CreateLogGroup',
'logs:CreateLogStream',
'logs:PutLogEvents',
'databrew:StartJobRun',
'iam:PassRole',
],
resources: ['*'],
}),
]),
});
forecastDatasetGroup.node.addDependency(forecastDataset);
// Here we import the csv dataset from S3. This can took up to 40 min
const datasetImportJob = new AwsCustomResource(this, `forecastDatasetImportJob`, {
onUpdate: {
service: 'ForecastService',
action: 'createDatasetImportJob',
parameters: {
DataSource: {
S3Config: {
Path: `s3://${outputBucket.bucketName}/${ForecastingProperties.PREFIX}-output`,
RoleArn: `${dataBrewRole.roleArn}`,
},
},
DatasetImportJobName: 'amazonForecastDatasetImportJob',
TimestampFormat: 'yyyy-MM-dd',
DatasetArn: forecastDataset.getResponseField('DatasetArn'),
},
physicalResourceId: { id: `forecastDatasetImportJob` },
},
policy: AwsCustomResourcePolicy.fromSdkCalls({ resources: AwsCustomResourcePolicy.ANY_RESOURCE }),
});
datasetImportJob.node.addDependency(forecastDatasetGroup);
// Last we train the model, which can take up to 2 hours
new AwsCustomResource(this, `forecastPredictor`, {
onUpdate: {
service: 'ForecastService',
action: 'createPredictor',
parameters: {
PredictorName: `costAndUsageReportTrainPredictor`,
ForecastHorizon: 7,
FeaturizationConfig: {
ForecastFrequency: 'D',
ForecastDimensions: ['account_id'],
},
PerformAutoML: true,
InputDataConfig: {
DatasetGroupArn: `arn:aws:forecast:${this.region}:${this.account}:dataset-group/amazonForecastDatasetGroup`,
},
},
physicalResourceId: { id: `forecastPredictor` },
},
onDelete: {
service: 'ForecastService',
action: 'deletePredictor',
parameters: {
PredictorArn: `arn:aws:forecast:${this.region}:${this.account}:predictor/costAndUsageReportTrainPredictor`,
},
},
policy: AwsCustomResourcePolicy.fromSdkCalls({ resources: AwsCustomResourcePolicy.ANY_RESOURCE }),
}).node.addDependency(forecastDatasetGroup);
view raw Forecast.ts hosted with ❤ by GitHub

We create a data-set from the csv files in the S3 output bucket. A datasetgroup is a container for datasets. After that we import the data-set in the datasetgroup, which is about 40 minutes. Last step is to train the time series model(predictor), which is performed using AutoML. AutoML select best algorithm like DeepAR, that is suitable for data-set. The sliding window is 7 days, but it could be more if you have more data. The training time for Amazon Forecast is about 2 hours and 40 minutes for the forecast. As last step we can create a forecast. The report will include the 0.5, 0.9 and 0.1 quartiles. Below is shown an example:

We select start and end date and we get the estimation

One important note, that was not mentioned is that the the model above was using costs per hour, but the same is possible for costs per day, which is normally used in production. This is the reason why the costs for the DocumentDb go down after midnight 24.10.2021 and are about 6$ for the day. This means that the costs will be around $180 for the Month with some degree of certainty.

You can also do this for different accounts by specifying the account id:

Summary: If you want a more granular forecasting of the costs hourly or daily, based on specific service, then AWS Glue databrew and Amazon Forecast will do the job. I would suggest to use daily and not hourly forecast, but the article serves just as and example and overview of the possibilities of these services. The source code is available below.

Sources/Source code:
GitHub - mnanchev/aws_cdk_forecast_cost_and_usage: Forecasting costs using costs and usage report
Forecasting AWS spend using the AWS Cost and Usage Reports, AWS Glue DataBrew, and Amazon Forecast…

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

Best Practices for Running  Container WordPress on AWS (ECS, EFS, RDS, ELB) using CDK cover image

Best Practices for Running Container WordPress on AWS (ECS, EFS, RDS, ELB) using CDK

This post discusses the process of migrating a growing WordPress eShop business to AWS using AWS CDK for an easily scalable, high availability architecture. The detailed structure encompasses several pillars: Compute, Storage, Database, Cache, CDN, DNS, Security, and Backup.

Read full post

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay