DEV Community

Cover image for Export AWS/GCP cost and billing data to BigQuery for analytics
Jinwook Baek
Jinwook Baek

Posted on

Export AWS/GCP cost and billing data to BigQuery for analytics

Original post has better layout with images.


Intro

aws
gcp

I am currently using both GCP and AWS for a certain project. While each cloud providers provides very nice tool and visualizations for their own spending (Cloud billing for GCP and Cost&Usage report for AWS respectively), I wanted to consolidate both cloud providers usage report and visualize using single BI tool. The report does not need to be realtime, I only need daily granularity on my report. (No streaming)

So I had to spin up something and these were couple options

  • House data to AWS using Athena+Quicksight(or other BI tool)
  • House data to GCP using Bigquery+datastudio(or other BI tool)
  • Spend more time searching for 3rd party tools

I am a big fan of BigQuery and I prefer to to ELT rather than ETL on BQ. Moreover, I was already using BigQuery for the project with holistics.io, I chose to house usage and cost data to bigquery.

Move data to BQ

GCP Cloud Billing

Setup

This is easy, cloud billing natively support bigquery export.

  • Enable billing for the GCP project.

    I assume that you already have billing account setup. (You wouldn't need data analytics unless you are spending any moeny) But if you need to enable billing for the project, refer to the following link.

    Modify a project's billing settings | Cloud Billing | Google Cloud

  • Create a dataset in BigQuery whthin the GCP project you desire - I named it billing_export

    Creating datasets | BigQuery | Google Cloud

  • You need appropriate permissions to setup export

    • Billing account Admin
    • Bigquery Admin for the project
    • resourcemanager.projects.update permission

Alt Text

  • Enable Cloud Billing export to BigQuery dataset
    • Go to Billing menu in Navigation menu (https://console.cloud.google.com/billing/)
    • Go to linked billing account (this should be already set, if not you need to add payment info)
    • Select Billing export → Select BigQuery export tab
    • Select edit setting on Daily cost detail
      • Select the project and dataset where you want the data to be sinked. (note) The BigQuery API is required to export data to BigQuery. If the project you selected doesn't have the BigQuery API enabled, you will be prompted to enable it. Click Enable BigQuery API and the API will be enabled for you

Once everything is set, you will be able to see your screen like this.

When you first enable the daily cost detail export to BigQuery, it might take a few hours to start seeing your Google Cloud cost data. Table with gcp_billing_export_v1_<some_hash> will be generated. Table will be automatically partitioned by day. (you can query partitioned data to save cost)

Alt Text

Schema

After a while you will notice a being exported to a table. You can click table and click preview or query.



SELECT usage_start_time, location.region, cost, currency, usage.amount, usage.unit, credits
FROM `<project>.billing_export.gcp_billing_export_v1_<hash>`
WHERE _PARTITIONTIME >= "2020-12-01 00:00:00"


Enter fullscreen mode Exit fullscreen mode

Alt Text

If you need specific information for each column, refer to this link. It will help you understand each columns of data.

Understanding the Cloud Billing data tables in BigQuery | Google Cloud

Special note on credits nested field if you are using free tier.

That's it for GCP! now to the hard part.

AWS Cost & Usage Report

I didn't want to reinvent the wheel, therefore as all the good developers do, I googled. I found couple of great articles regarding this subject, but they seemed to be complicated and outdated. Luckly, there were already a natively supported S3 transfer service in BigQuery. It supports transfer csv and parquet.

Alt Text
seems too complicated.

Amazon S3 transfers | BigQuery Data Transfer Service | Google Cloud

Let's save the cost and usage report to S3 first. Go to AWS Cost and Usage Reports. You can create a report here.

aws usage and cost reports

Alt Text
Alt Text

  • Report name - daily_report_gzip
  • Include resource IDs - check
  • Data refresh settings - uncheck
  • S3 bucket - create a bucket or select a bucket
  • prefix - billing/daily_report_gzip
  • Time granularity - Daily
  • Report versioning - overwrite
  • Compression type - gzip

After creating report, it takes couple hours to show up on S3. Once report is created, you will see gz file exported under following URI.

s3://<bucket>/billing/daily_report_gzip/20201201-20210101/

  • data - daily_report_gzip-00001.csv.gz
  • metadata - daily_report_gzip-Manifest.json

Alt Text

Schema

We need to know the schema in order to house the data in BigQuery table. If you want to know what each fields represents, consult the document.

Data dictionary

When you open up the daily_report_gzip-Manifest.json file, you will discover how each columns are structued. Take a note on variety of types - string, OptionalBigDecimal, DateInterval and so on.

  • Json - manifest file example

Alt Text

Converting manifest to bq table schema

In order to transfer the data, the BQ table needs to have correct schema to house the csv data. BQ have different types and there are too many fields to just copy and paste. You can configure table chema with Edit as text. (ex: Field1:TYPE, Field2:TYPE) Let's use jq to extract types as bq table schema.

I also need to replace some fields type to BigQuery compatible types.

  • BigDecimal, OptionalBigDecimal → bignumeric
  • Interval → string
  • datetime → timestamp
  • OptionalString → string

Download the json file and pipe with following jq command



cat daily_report_gzip-Manifest.json | jq -jr '.columns[] | .category+"_"+.name+":"+ .type +","' | sed s/.$// | sed 's/OptionalString/String/g' | sed 's/\<Interval\>/String/g' | sed 's/OptionalBigDecimal/BigNumeric/g' | sed 's/BigDecimal/BigNumeric/g' | sed 's/DateTime/Timestamp/g'


Enter fullscreen mode Exit fullscreen mode
  • jq -jr '.columns[] | .category+"_"+.name+":"+ .type +","'
    • this will extract each items in array and struct filed as <category>_<name>:<type>. I needed category prepended because there are duplicate names
    • -r for raw output
  • sed s/.$//
    • remove last ,
  • sed 's/<original type>/<bq type>/g'
    • replace type to bq compatible types

This command will result in following texts



identity_LineItemId:String,identity_TimeInterval:String,bill_InvoiceId:String,bill_BillingEntity:String,bill_BillType:String,bill_PayerAccountId:String,bill_BillingPeriodStartDate:Timestamp,bill_BillingPeriodEndDate:Timestamp,lineItem_UsageAccountId:String,lineItem_LineItemType:String,lineItem_UsageStartDate:Timestamp,lineItem_UsageEndDate:Timestamp,lineItem_ProductCode:String,lineItem_UsageType:String,lineItem_Operation:String,lineItem_AvailabilityZone:String,lineItem_ResourceId:String,lineItem_UsageAmount:BigNumeric,lineItem_NormalizationFactor:BigNumeric,lineItem_NormalizedUsageAmount:BigNumeric,lineItem_CurrencyCode:String,lineItem_UnblendedRate:String,lineItem_UnblendedCost:BigNumeric,lineItem_BlendedRate:String,lineItem_BlendedCost:BigNumeric,lineItem_LineItemDescription:String,lineItem_TaxType:String,lineItem_LegalEntity:String,product_ProductName:String,product_accountAssistance:String,product_alarmType:String,product_architecturalReview:String,product_architectureSupport:String,product_availability:String,product_baseProductReferenceCode:String,product_bestPractices:String,product_cacheEngine:String,product_capacitystatus:String,product_caseSeverityresponseTimes:String,product_clientLocation:String,product_clockSpeed:String,product_computeFamily:String,product_computeType:String,product_cputype:String,product_currentGeneration:String,product_customerServiceAndCommunities:String,product_databaseEngine:String,product_deploymentOption:String,product_description:String,product_dominantnondominant:String,product_durability:String,product_ecu:String,product_endpointType:String,product_engineCode:String,product_enhancedNetworkingSupported:String,product_eventType:String,product_feeDescription:String,product_fromLocation:String,product_fromLocationType:String,product_group:String,product_groupDescription:String,product_includedServices:String,product_insightstype:String,product_instanceFamily:String,product_instanceType:String,product_instanceTypeFamily:String,product_intelAvx2Available:String,product_intelAvxAvailable:String,product_intelTurboAvailable:String,product_launchSupport:String,product_licenseModel:String,product_location:String,product_locationType:String,product_logsDestination:String,product_maxIopsBurstPerformance:String,product_maxIopsvolume:String,product_maxThroughputvolume:String,product_maxVolumeSize:String,product_memory:String,product_memoryGib:String,product_memorytype:String,product_messageDeliveryFrequency:String,product_messageDeliveryOrder:String,product_minVolumeSize:String,product_networkPerformance:String,product_normalizationSizeFactor:String,product_operatingSystem:String,product_operation:String,product_operationsSupport:String,product_physicalProcessor:String,product_preInstalledSw:String,product_pricingUnit:String,product_proactiveGuidance:String,product_processorArchitecture:String,product_processorFeatures:String,product_productFamily:String,product_programmaticCaseManagement:String,product_queueType:String,product_region:String,product_requestDescription:String,product_requestType:String,product_resourcePriceGroup:String,product_routingTarget:String,product_routingType:String,product_servicecode:String,product_servicename:String,product_sku:String,product_storage:String,product_storageClass:String,product_storageMedia:String,product_storageType:String,product_technicalSupport:String,product_tenancy:String,product_thirdpartySoftwareSupport:String,product_tiertype:String,product_toLocation:String,product_toLocationType:String,product_trafficDirection:String,product_training:String,product_transferType:String,product_usagetype:String,product_vcpu:String,product_version:String,product_volumeApiName:String,product_volumeType:String,product_whoCanOpenCases:String,pricing_LeaseContractLength:String,pricing_OfferingClass:String,pricing_PurchaseOption:String,pricing_RateId:String,pricing_currency:String,pricing_publicOnDemandCost:BigNumeric,pricing_publicOnDemandRate:String,pricing_term:String,pricing_unit:String,reservation_AmortizedUpfrontCostForUsage:BigNumeric,reservation_AmortizedUpfrontFeeForBillingPeriod:BigNumeric,reservation_EffectiveCost:BigNumeric,reservation_EndTime:String,reservation_ModificationStatus:String,reservation_NormalizedUnitsPerReservation:String,reservation_NumberOfReservations:String,reservation_RecurringFeeForUsage:BigNumeric,reservation_ReservationARN:String,reservation_StartTime:String,reservation_SubscriptionId:String,reservation_TotalReservedNormalizedUnits:String,reservation_TotalReservedUnits:String,reservation_UnitsPerReservation:String,reservation_UnusedAmortizedUpfrontFeeForBillingPeriod:BigNumeric,reservation_UnusedNormalizedUnitQuantity:BigNumeric,reservation_UnusedQuantity:BigNumeric,reservation_UnusedRecurringFee:BigNumeric,reservation_UpfrontValue:BigNumeric,savingsPlan_TotalCommitmentToDate:BigNumeric,savingsPlan_SavingsPlanARN:String,savingsPlan_SavingsPlanRate:BigNumeric,savingsPlan_UsedCommitment:BigNumeric,savingsPlan_SavingsPlanEffectiveCost:BigNumeric,savingsPlan_AmortizedUpfrontCommitmentForBillingPeriod:BigNumeric,savingsPlan_RecurringCommitmentForBillingPeriod:BigNumeric

Enter fullscreen mode Exit fullscreen mode




Create BQ Table

Let's create a BQ table. You should have a dataset created already if you exported a Cloud billing. If not, let's create now.

Once dataset if created, click create table.

  • table from - Empty table
  • Destination - search for a project
  • Project name - project you created a dataset
  • Dataset name - select one you created
  • Talbe name - aws
  • For Schema, toggle Edit as text and paste the schema you generated with jq command
  • Partion - you can skip this or select a timestap filed that you will be commonly filter with. I chose billingPeriodStartData and partition by day.

Alt Text
Alt Text

Transfer Prereq

Once the table is ready, let's create a transfer.

You need to have following permissions first.

  • bigquery.transfers.update
  • bigquery.datasets.update

These permissions are includede in predefiend bigquery.admin role.

AWS credential

Create IAM for AWS credential for transfer. You only need AmazonS3ReadOnlyAccess permission on specific bucket you created, but if you are lazy, just use AmazonS3ReadOnlyAccess AWS managed policy on all resources. If you don't know how to create a IAM user, please refer to following link.

Amazon S3 - Creating an IAM user

Configure Transfer

On bigquery console, click transfer menu on left pane. You will be redirected to a transfer configuration page. Click + Create Transfer button.
Alt Text

  • Source - Amazon S3
  • Display Name - aws-billing-export
  • Schedule options - starts now, daily
  • Destination

    • Dataset ID - dataset you created
    • Destination table - aws
    • Amazon S3 URI - s3://<bucket>/billing/daily_report_gzip/*/*.gz
    • Access key ID
    • Secert access key
    • File format - csv
      • Bq support gzip natively
    • Transfer option
      • number of errors allowed - 0
      • ignore unknown values - check
      • field delimieter - ,
      • Header rows to skip - 1
    • Notification
      • you can create new pub/sub and configure notification here. You can configure this later. So leave it blank for now.

    Click Save!

Alt Text
Alt Text

You willing following green check mark if the transfer succeeds.

Alt Text

If you see error with red marker, exame the error logs on right pane. Common erros consist,

  • Wrong S3 URI
  • Wrong AWs credential
  • Wrong table schema
  • No objects

Cleaning Data

You have two tables populated with billing data! It's time do some analytics and visualizations.

Let's review fields

AWS

Data dictionary

GCP

Understanding the Cloud Billing data tables in BigQuery | Google Cloud

You can use BQ console to query and testout data for integrity to makesure the load is intact. I won't go into details with sql query.

Once Extracted and Loaded, I do Transform on Holistics.io leaving the data intact, but you can also create new view to minimize, clean and consolidate the data into single table or a view using scheduled query feature of BigQuery.

Visualize data

After inspecting the with couple queries, and I am just a newbie in sql world, I use Data Studio to inspect the data.

When you use BigQuery, You can connect data to Data Studio without extra configuration.
Alt Text
Alt Text

You can use other BI tool to connect BQ to visualize and analyze you data from here.

Extra

  • [ ] Check data integrity with daily and monthly roll over

Conclusion

As BQ support S3 transfer, it was a lot easier for me to move data around but still there were some work to be done.

Please let me know if you find more concrete solution to consolidate billing data in single place!

Thank you for reading here are some pictures of Kokos!

Alt Text
Alt Text

Reference

Export Cloud Billing data to BigQuery | Google Cloud

Overview of Amazon S3 transfers | BigQuery Data Transfer Service

Using runtime parameters in transfers | BigQuery Data Transfer Service

Top comments (5)

Collapse
 
tcutts profile image
Tim Cutts

This is a great post. I too had previously found the very complex AppEngine workflow, but this version is much simpler. I chose to use parquet as the export format, but I don't think that makes much difference. What I have observed though is that I end up with duplicated AWS billing data the next day; the AWS billing export is updated every day in the S3 bucket, and if I run the BigQuery transfer again, end up with doubled costs because of duplicated and changed rows. How did you get around this problem? Just de-duplicating the data naively with SELECT DISTINCT * doesn't quite work either, because some rows aren't exactly the same, but the second day's row is a cumulative update to the previous day's row, so you still have the cost in there twice for long-running resources. Did you find a solution to this issue?

Collapse
 
mounikauni profile image
mounika-uni

Did you find a solution for this by any chance?

Collapse
 
tcutts profile image
Tim Cutts • Edited

Yes - I partition the BigQuery table by day, and then delete the duplicates with a cloud function as follows. I use a BigQuery Transfer job to copy the AWS data from S3 into BigQuery once a day, and this cloud function triggers to tidy up.

// Copyright 2021 Genome Research Limited
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//      http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

'use strict';

const {BigQuery} = require('@google-cloud/bigquery')

const bq = {
    location: 'EU',
    project: "YOUR_GOOGLE_PROJECT",
    dataset: "YOUR_AWS_DATASET",
    table: "YOUR_AWS_TABLE"
}

async function runQuery() {
    const bigquery = new BigQuery()

    const query = `
    DECLARE max_date TIMESTAMP;
    DECLARE current_bill_period TIMESTAMP;
    SET max_date = (SELECT MAX(_PARTITIONTIME)
      FROM \`${bq.project}.${bq.dataset}.${bq.table}\`);
    SET current_bill_period = (SELECT MAX(bill_billing_period_start_date)
      FROM \`${bq.project}.${bq.dataset}.${bq.table}\`
      WHERE _PARTITIONTIME = max_date);

    DELETE FROM \`${bq.project}.${bq.dataset}.${bq.table}\`
    WHERE
      bill_billing_period_start_date = current_bill_period
      AND _PARTITIONTIME < max_date;
    `
    const options = {
        query: query,
        location: bq.location
    }

    const [job] = await bigquery.createQueryJob(options);
    console.log(`Bigquery delete Job ${job.id} started`)
    const [rows] = await job.getQueryResults()
    console.log("Results:")
    rows.forEach(row => console.log(row))
}

exports.deleteAWSDupBill = (message, context) => {

  const data = JSON.parse(Buffer.from(message.data, 'base64').toString());

  if (data.state === "SUCCEEDED" &&
      data.destinationDatasetId === bq.dataset &&
      data.params.destination_table_name_template === bq.table) {
      runQuery()
  }
};
Enter fullscreen mode Exit fullscreen mode
Collapse
 
awsgeekcoder profile image
AWSGeekCoder

In case you want to transfer data from BigQuery to S3. You can follow this video as it explains in very detail each step to transfer data from BigQuery to S3.

[Transfer data from Google BigQuery to AWS S3 In Under 10 Minutes] - youtu.be/whHpS4mbwdU

Collapse
 
lonzobrown1 profile image
Lonzo Brown

I am see a website that are working on Medical Billing Services in New Jersey I want to make some code will you provide me it.