DEV Community: Ashish Panda

Reducing Data Lake Costs on AWS by 80%: A High-Level Strategy Guide

Ashish Panda — Mon, 30 Jun 2025 17:07:46 +0000

At DNIF Hypercloud, a cybersecurity company processing millions of security events per second, data is at the core of everything we do. Our workloads are incredibly data-intensive, which means managing our data lake infrastructure on AWS is crucial for both performance and cost efficiency. This blog post shares the key insights and strategies that enabled us to re-architect our data lake and achieve an impressive 80% reduction in our data lake’s AWS cost.

Key Insights into Cost Reduction

Achieving such significant cost savings wasn't about minor tweaks; it involved a fundamental shift in our architectural approach and operational practices. Here are the core insights that guided our transformation:

Deep Dive into Data Usage: Understanding exactly how our data was being used, which analytics cases were most frequent, and the underlying raw data they consumed was paramount. This granular understanding allowed us to make informed decisions about data placement and access.
Optimizing for Performance and Cost: We realized that reducing network latency and increasing compute efficiency could directly translate into lower costs. Less time spent waiting for data means less compute time consumed.
Embracing Elasticity: Traditional fixed-capacity infrastructure often leads to over-provisioning and wasted resources. Shifting to an elastic, autoscaling model was key to aligning our infrastructure with actual demand.
Leveraging Cloud-Native Features: AWS offers a vast array of services and pricing models. Intelligent utilization of these features, like spot instances, can unlock substantial savings.

High-Level Strategies for Cost Optimization

The following strategies were instrumental in realizing our 80% cost reduction. While the specific tools for our data lake storage and processing engine are not disclosed, the principles can be applied to various well-known tools in the market.

1. Reducing Network Latency Between Data Storage and Processing

We identified that a significant portion of our data analytics use cases frequently queried a specific subset of raw data. To address this, we implemented a caching layer:

We grouped our data analytics use cases and identified the frequently queried data patterns.
Based on these patterns, we cached several terabytes of this raw data closer to our data processing engine, utilizing a high-performance network file system local to a single Availability Zone (AZ).
Approximately 75% of our data analytics use cases were served by this local cache.
The remaining 25% of use cases continued to be powered by our multi-AZ, high-scale storage layer, which handles hundreds of terabytes of data.

This strategic caching reduced latency, lowered the compute load on our central storage, and significantly increased the compute efficiency at our data processing layer due to fewer I/O waits. As a result, our overall analytics performance improved by 58%, directly contributing to an approximately 28% reduction in our overall data lake cost.

2. Shifting to Kubernetes for Data Processing

Our data processing engine previously ran on a traditional fixed-count VM-based architecture. We transitioned this to pods on Kubernetes, enabling an autoscaling fashion powered by Keda. This shift offered several advantages:

Dynamic Capacity Adjustment: Kubernetes allowed us to dynamically adjust our infrastructure and compute capacity based on actual load, eliminating the need for constant manual provisioning and de-provisioning.
Reduced VM Baggage: Moving to a pod-based architecture reduced the overhead associated with managing traditional VMs.
Resource Efficiency: Our pods became more resource-efficient, consuming only the necessary compute resources for their tasks.

This migration to Kubernetes and an autoscaling model resulted in an approximate 20% saving in our AWS costs.

3. Adopting a Mix of Spot and On-Demand Instances

To further optimize our compute costs, we made our data processing layer fault-tolerant through smart retries and robust error handling. This crucial groundwork allowed us to leverage AWS Spot Instances for our data processing engine running on Kubernetes.

We implemented Karpenter, a flexible, high-performance Kubernetes cluster autoscaler, to dynamically provision the cheapest available Spot or On-Demand instances for our Kubernetes nodes based on a specified policy.
This approach allowed us to take advantage of the significant cost savings offered by Spot Instances while ensuring resilience through our application-level fault tolerance.

By intelligently combining Spot and On-Demand instances, we achieved an impressive 33% saving on our overall data lake cost.

The Impact: Beyond Cost Savings

The rearchitecture of our data lake at DNIF Hypercloud yielded benefits far beyond just cost reduction:

Improved Operational Efficiency: The automation and elasticity introduced by Kubernetes and dynamic instance provisioning streamlined our operations.
Better Resource Utilization: We are now consuming only the resources we need, when we need them, minimizing waste.
Enhanced Performance: The latency reduction and compute efficiency gains led to faster analytics performance, directly impacting our ability to respond to security threats.
Enabling Further Innovation: The significant cost savings have freed up budget, allowing us to invest in further innovation and development within our cybersecurity platform.

Conclusion

Our journey at DNIF Hypercloud demonstrates that substantial cost reductions in AWS-hosted data lakes are achievable through strategic rearchitecture. By focusing on reducing network latency, embracing serverless and containerized approaches with Kubernetes, and intelligently leveraging AWS pricing models like Spot Instances, we were able to reduce our data lake costs by 80%. This transformation not only delivered significant financial savings but also improved the performance and efficiency of our critical data analytics capabilities. We encourage other data-intensive organizations to explore similar strategies to optimize their cloud data infrastructure.

Avoid AWS Billing Surprises: Simplify Cloud FinOps

Ashish Panda — Sat, 14 Sep 2024 23:43:35 +0000

A few months ago, in our organisation, we encountered a situation where we saw an unexpected spike in our AWS bill at the end of the month. This prompted a thorough investigation, and we realized that we hadn't accounted for a particular scenario at production scale, leading to an unexpected overspend. This experience initiated a mini-project within the organization focused on implementing alerting measures to avoid such billing surprises. I’m excited to say that the small mistake we made earlier has now helped us avoid a 3x monthly bill due to an application misconfiguration! I’d like to share some really simple yet powerful approaches to saving your organization from cost surprises.

AWS offers powerful built-in tools and mechanisms for cloud cost management, making it easier to handle Cloud FinOps. Here are the top 3 ways to go about it:

Configure daily billing alerts
Set up cost anomaly detection
Use Cost Explorer for root cause detection

Configure daily billing alerts

Alerts help you respond to events quicker, and there are two ways to create billing alerts in AWS:
a. Create a CloudWatch billing alarm
b. Create an AWS cost budget

You can use both depending on your needs, though the cost budget approach is my personal favorite for its ease of configuration and rich features. Create a budget with the following parameters:

Period: Daily
Budget amount: Your expected daily spending
Scope: 'All AWS Services'

Configure three alert thresholds:

At the lowest point possible (e.g., 80%) – This ensures you receive the previous day's spending email daily, regardless of whether the budget is exhausted.
At 100% – To alert you when your budget is hit.
At a higher conservative point (e.g., 125%) – To notify you when your ultimate limit is reached, signaling that it’s time to take action.

Make sure to revise your budget and thresholds as your AWS spending evolves.

Set up cost anomaly detection

Using machine intelligence to detect unusual cost patterns across AWS services and accounts can make life easier when it comes to spotting and analyzing cost spikes. AWS offers a free tool (free as of the time of writing this blog) called Cost Anomaly Detection, which allows you to create cost monitors across various dimensions. I recommend setting up at least one monitor of type 'AWS Service' and another of type 'Linked Account' if your organization has multiple accounts. When you receive an alert, the Anomaly Details dashboard is a great starting point for root cause analysis.

Use Cost Explorer for root cause detection

Cost Explorer is a powerful tool for understanding and analyzing cloud spend. It’s recommended to schedule weekly and monthly reviews to analyze cost behavior from the previous period. A solid understanding of how to use Cost Explorer is essential for anyone managing Cloud FinOps. Knowing how to apply dimensions and filters to handle various root cause analysis scenarios and analyzing trends to derive actionable steps are key skills. Being well-versed in Cost Explorer can make the Cloud FinOps journey smoother for organizations of any scale.

Share your approach and suggestions for Cloud FinOps in the comments below. I'd love to try them out!

My 5-Year Journey on the Cloud

Ashish Panda — Fri, 19 Jul 2024 20:15:58 +0000

Sitting back on a Friday evening, reminiscing about my five-year adventure in the cloud world, fills me with nostalgia and excitement. Let me take you through this journey, skipping the part about how I landed here (that's a story for another day) and jumping straight to the professional milestones that began in September 2019.

Year 1: Diving into the Cloud

In September 2019, I officially took my first plunge into the cloud with a freelance project on Google Cloud Platform (GCP). My client wanted to host their WordPress-based CMS on the cloud, and I, armed with GCP's $300 credit for new customers, was more than eager to help. It was a win-win!

That first year was a whirlwind. I landed four more clients, each bringing new challenges—static website hosting, three-tier web application hosting, deployment management, cloud finops, and small-scale networking and security tasks. These projects built my confidence and gave me a solid foundation in servers, networking, OS, and infrastructure concepts. The best part? Seeing my clients happy with the results.

Year 2: Discovering the Power of DevOps

The second year saw me diving deeper. I tackled two major client projects and a couple of smaller ones. This was when I unknowingly stepped into the world of DevOps, using Docker containers, CI/CD pipelines, and Git, without even knowing what DevOps meant! (More on this in my upcoming DevOps journey blog.)

One client's project pushed me to explore AWS. With my GCP experience as a foundation, picking up AWS wasn't too challenging. Thanks to Stephane Maarek's Udemy course, I quickly got up to speed with core AWS services. Juggling between GCP and AWS, I learned a ton and made good money. By the end of the year, I was wrapping up my freelance projects, ready to step into the corporate world.

Year 3: Corporate Adventures Begin

Joining one of India's top BFSI enterprises in my third year was a game-changer. The environment was buzzing with innovation, and the leadership was incredibly supportive. I got to witness and contribute to on-premise data center to cloud migrations, complex network and security architectures, and navigating stringent compliance regulations.

Working with a big cross-functional team, I sharpened my AWS skills and initiated some cool projects. Plus, I began delving into DevOps initiatives, setting the stage for future endeavors. The corporate world was challenging but filled with incredible learnings.

Year 4: Leading the Charge

Year four was all about driving change. I led cloud cost-saving initiatives and steered application and database modernization projects. These efforts aimed to make our architectures cloud-native, fault-tolerant, scalable, and cost-efficient.

Mid-year, I got a thrilling opportunity: leading a disaster recovery project from scratch in another AWS region. This project was a massive learning experience. I spent nights poring over best practices, enterprise standards, AWS blogs, and Udemy courses. By day, I planned, hosted meetings, and navigated the enterprise's processes. After several iterations and expert consultations, we successfully built the landing zone, got core infra and security tools in place, cleared security audits and performed DR drills for a few apps. This project was a game-changer, deepening my understanding of technology and processes from multiple angles.

Year 5: A New Beginning in Cybersecurity

The fifth year brought a major shift. I bid farewell to my previous organization and joined a cybersecurity company, bringing my AWS and DevOps expertise to the table. The visionary CEO and the exciting challenges ahead were too good to pass up.

I was tasked with building a parallel cloud environment with the best AWS services and rearchitecting the product for Kubernetes, focusing on high scale, performance, and stability. My past experiences were invaluable in designing this solution from scratch. Overseeing cloud operations, managing app upgrades, security posture, vulnerabilities, monitoring controls, and stability—most of which we automated—was both challenging and exhilarating. As I write this, I’m thrilled by our progress and excited for what's next.

And with that, I sign off. How would you describe your journey in the cloud? Share your thoughts in the comments — I’d love to hear your stories.
Stay tuned for my upcoming post where I dive deeper into my DevOps journey!