DEV Community

Cover image for Balancing Act: Tips for Cost Optimization in AWS Data Lake Architectures
Yogesh Sharma for AWS Community Builders

Posted on

Balancing Act: Tips for Cost Optimization in AWS Data Lake Architectures

Building and maintaining a robust data lake on AWS is a crucial step towards enabling data-driven decision-making. However, managing the costs associated with various components of a data lake is equally important. In this blog post (Level 200), we will explore cost optimization strategies for key AWS data lake components. By implementing these strategies, organizations can ensure efficiency, scalability, and cost-effectiveness in their data lake operations.

Understanding AWS Data Lake Components

An AWS data lake is a centralized repository that allows organizations to store structured and unstructured data at any scale. It enables data exploration and analytics, providing a foundation for business intelligence and machine learning. AWS data lakes comprise various services like Amazon S3 (for storage), AWS Glue (for data catalog and ETL), Amazon Athena (for querying data), Amazon Redshift (for data warehousing), and Amazon EMR (for big data processing), among others.
Across both real-time and batch processing, data flows through different stages. For each stage, there is an option to use managed services from AWS or to use compute, storage, or network services from AWS with third-party or open source software installed on top it.

Data Lake
In the case of managed services, AWS provides service features like high availability, backups, and management of underlying infrastructure at additional cost. In some cases, the managed services are on-demand serverless based solutions, where the customer is charged only when the service is used.

Cost Optmization Tips

  • Choose a columnar format Data serialization formats like Parquet, ORC, and Avro offer efficient storage and processing capabilities. Choosing the right format can lead to cost savings. Evaluate data formats based on the nature of your data and the processing workloads to select the most cost-effective option. Choosing a columnar format such as parquet, ORC or Avro helps reduce disk I/O requirements which can reduce costs and drastically improve performance in data lakes.

E.g. CSV:
Cost of query (102.9 GB scanned): $0.10 per query
Speed of query: 4 minutes 20 seconds Partitioned:
Parquet:
Cost of query (1.04 GB scanned): $0.001 per query
Speed of query: 12.65 seconds
Simply using Parquet data format can save 99% and improved performance by 95%. It depends on data we are querying but as a general modern data stack rule, always try to use columnar data format.

  • Partition data
    Partitioning data allows you to reduce the amount of data that is scanned within queries. This helps both reduce cost and improve performance.

  • Use data compression
    Data compression reduces storage costs and improves data transfer efficiency. However, it's important to balance compression ratios with processing overhead.
    Use compression algorithms like Gzip, Snappy, or Zstandard to compress data before storage or transfer, considering the trade-offs between compression ratios and CPU usage.

    8 GB CSV dataset using bzip2 compression can reduce it to 2 GB i.e. 75% reduction. Compression also reduces the amount of data scanned by compute services such as Amazon Athena.

  • Create data lifecycle policies
    Object Lifecycle policies in S3 allow you to automatically transition objects to different storage classes or delete them based on defined rules. Set up lifecycle policies to move data to lower-cost storage classes (e.g., S3 Infrequent Access, S3 One Zone-IA) or delete stale data. n data lakes we generally keep a copy of the raw in-case we need to reprocess data as we find new questions to ask of our data. However, often this data isn’t required during general operations. Once we have processed our raw data for the organization, we can choose to move this to a cheaper tier of storage (i.e. Glacier).

    E.g. for 100TB of data, Standard Amazon S3: $2,406.40 per month
    Amazon S3 Glacier: $460.80 per month. ~80% saving

s3-lifecyle

  • Implementing Intelligent Tiering S3 Intelligent-Tiering is a storage class that automatically moves objects to the most cost-effective storage class based on access patterns. Use S3 Intelligent-Tiering for objects with unpredictable access patterns to ensure optimal storage costs. > S3 Analytics provides insights into storage usage patterns, helping you make informed decisions about which storage class to use for different data sets. Analyze S3 Analytics reports to identify opportunities for optimizing storage class usage.

s3-classes

  • Right size your instances AWS services are consumed on demand. This means that you pay only for what you consume. Like many AWS services, when you run the job you define how many resources you want and you also choose the instance size you want. You also have ability to stop the instance when it’s not being used and you can spin it up based on specific schedule. > E.g. Selecting the appropriate instance types for EMR clusters can significantly impact costs. Using the right mix of instance types can optimize performance while keeping costs in check. Use instance fleets in EMR to diversify instance types and sizes based on the specific needs of your workloads. Also, transient EMR clusters can be used for scenarios where data has to be processed once a day or once a month. Monitor instance metrics with analytic workloads and downsize the instances if they are over provisioned.

ec2-right-sizing

  • Use Spot Instances Amazon EC2 Spot Instances let you take advantage of unused Amazon EC2 capacity in the AWS Cloud. Spot Instances are available at up to a 90% discount compared to ondemand prices. Using spot helps the enterprise save up to 40 percent over Amazon EC2 Reserved Instance pricing and up to 80 percent over on-demand pricing. Spot is a great use case for batch when time to insight is less critical. >E.g. an EMR cluster takes: On-Demand Instance: $106.56 per month Spot Instance (estimated at a conservative 70%): $31.96 per month This represents a saving of just over 70%.
  • Use Reserved Instances Amazon EC2 Reserved Instances (RI) provide a significant discount (up to 72%) compared to On-Demand pricing and provide a capacity reservation when used in a specific Availability Zone. This can be useful if some services used in the data lake will be constantly utilized.
  • Choose the right tool for the job
    Picking the right tool for the job, helps reduce cost. If you are using Kinesis Data Streams and pushing data into it, you can choose between the AWS SDK, the Kinesis Producer Library (KPL), or the Kinesis Agent. Sometimes the option is based on the source of the data and sometimes the developer can choose.
    Using the latest KPL enables you to use a native capability to aggregate multiple messages/events into a single PUT unit (Kinesis record aggregation). Kinesis data streams shards support up to 1000 records per second or 1-MB throughput. Record aggregation by KPL enables customers to combine multiple records into a single Kinesis Data Streams record. This allows customers to improve their per shard throughput.

    E.g. A long running EMR cluster for small spark job is a waste in front of transient EMR cluster (if the cold start is not an issue else use Glue job). Similarly, to implement data privacy better to use Glue Databrew service that provides more than 100 out of box transformations to perform the task.

  • Use automatic scaling
    Automatic scaling is the ability to spin resources up and down based on the need. Building application elasticity enables you to incur cost only for your fully utilized resources.
    Building EC2 instances using an Auto Scaling group can provide elasticity for Amazon EC2 based services where applicable. Even for serverless services like Kinesis where shards are defined per stream during provisioning, AWS Application Auto Scaling provides ability to automatically add or remove shards based on utilization.

    For example, if you are using Kinesis streams to capture user activity for an application hosted in specific Region the streaming information might vary between day and night. During day times, when the user activity is higher you might need more shards compared to night times when the user activity would be very low. Being able to configure AWS Application Auto Scaling based on your utilization enables you to optimize your cost for Kinesis. Data flowing at rate of 50,000 records/sec with each record having 2 K bytes, will require 96 shards. If the data flow reduces to 1000 records/sec for eight hours during night, it would only need two shards.

  • Adopt Glue Best practice
    Upgrade to the latest version- Having AWS Glue jobs running on the latest version enables you to take advantage of the latest functionalities and improvements offered by AWS Glue and the upgraded version of the supported engines such as Apache Spark. For example, AWS Glue 4.0 includes the new optimized Apache Spark 3.3.0 runtime and adds support for built-in pandas APIs as well as native support for Apache Hudi, Apache Iceberg etc
    Auto scaling- TO avoid overprovision workers use AWS Glue auto scaling that helps you dynamically scale resources up and down based on the workload, for both batch and streaming jobs. Auto scaling reduces the need to optimize the number of workers to avoid over-provisioning resources for jobs, or paying for idle workers.

    To enable auto scaling on AWS Glue Studio, go to the Job Details tab of your AWS Glue job and select Automatically scale number of workers.

glue-auto-scaling
Flex- For non-urgent (non production) data integration workloads that don’t require fast job start times or can afford to rerun the jobs in case of a failure, Flex could be a good option. The start times and runtimes of jobs using Flex vary because spare compute resources aren’t always available instantly and may be reclaimed during the run of a job.

To enable Flex on AWS Glue Studio, go to the Job Details tab of your job and select Flex execution.
Set the job’s timeout period appropriately

BD

  • Choose serverless services Serverless services are fully managed services that incur costs only when you use them. By choosing a serverless service, you are eliminating the operational cost of managing the service. For example, in scenarios where you want to run a job to process your data once a day, using serverless service incurs a cost only when the job is run. In comparison, using a self-managed service incurs a cost for the provisioned instance that is hosting the service. Running a Spark job in AWS Glue to process a CSV file stored in Amazon S3 requires 10 minutes of 6 DPU’s and cost $0.44. To execute the same job on Amazon EMR, you need at least three m5.xlarge instances (for high availability) running at a rate of $0.240 per hour.

Effectively managing costs in an AWS data lake requires a combination of strategic planning, continuous monitoring, and the implementation of cost optimization techniques. By following the strategies outlined in this comprehensive guide, organizations can achieve a balance between cost efficiency and data lake performance, ultimately driving value from their data assets. Remember, cost optimization is an ongoing process that evolves alongside your data lake infrastructure and usage patterns.

Top comments (0)