<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Orel Bello</title>
    <description>The latest articles on DEV Community by Orel Bello (@orelbello).</description>
    <link>https://dev.to/orelbello</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3033456%2F2d285d3b-63b7-4312-b2af-52de36dba934.jpeg</url>
      <title>DEV Community: Orel Bello</title>
      <link>https://dev.to/orelbello</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/orelbello"/>
    <language>en</language>
    <item>
      <title>The Hard Truth About Platform Engineering Adoption</title>
      <dc:creator>Orel Bello</dc:creator>
      <pubDate>Mon, 23 Feb 2026 07:13:54 +0000</pubDate>
      <link>https://dev.to/aws-builders/the-hard-truth-about-platform-engineering-adoption-3p47</link>
      <guid>https://dev.to/aws-builders/the-hard-truth-about-platform-engineering-adoption-3p47</guid>
      <description>&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;You know how it is. There is always this ancient struggle between doing things the fast way or the right way. Most of the time, the right way is slower, but we still have to deliver, and fast.&lt;/p&gt;

&lt;p&gt;Platform Engineering adoption doesn't fail because of tooling. &lt;br&gt;
It fails because of habits.&lt;/p&gt;

&lt;p&gt;In a world where every request is critical and urgent (I'll never forget the developer who opened a ticket with a severity of "Production is down," saying that his personal AWS account didn't work, and it was Production for him), the struggle is real between fixing it manually because it's urgent, or investing time in building automation that will do it the right way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So what would you do?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We're a small DevOps team, responsible for more than 200 engineers. Naturally, we started handling requests manually, tying up loose ends and eliminating blockers. &lt;br&gt;
After all, nobody wants to be the one slowing everyone down.&lt;/p&gt;

&lt;p&gt;But if you continue doing things manually just to keep up with urgent requests, in the long run you'll slow the entire company down. What works for a company of 50 engineers doesn't always scale to 200 engineers.&lt;/p&gt;

&lt;p&gt;That's where Platform Engineering comes into play.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Platform Engineering Actually Means
&lt;/h2&gt;

&lt;p&gt;The Platform Engineering concept is pretty simple. Instead of handing developers a fish, we give them a fishing rod.&lt;/p&gt;

&lt;p&gt;If a traditional DevOps team builds infrastructure for developers, here we give them the tools to do it themselves.&lt;/p&gt;

&lt;p&gt;More importantly, Platform Engineering is about creating a default way of working. &lt;br&gt;
Not just tools that developers can use, but paths they're expected to use.&lt;/p&gt;

&lt;p&gt;The goal is to eliminate bottlenecks and accelerate innovation. If the DevOps team can't take a day off without the company going up in flames, something probably needs to change.&lt;/p&gt;

&lt;p&gt;We need to adopt an enablement mindset. How can we give developers tools to work independently, while still keeping the organization's best practices?&lt;/p&gt;

&lt;p&gt;Remember, the wisdom is to find ways to allow, not to block. Although, as we will see later on, sometimes blocking is inevitable.&lt;/p&gt;

&lt;h2&gt;
  
  
  First Steps
&lt;/h2&gt;

&lt;p&gt;The first thing we did was build a self-service platform, which we call the Buffet.&lt;/p&gt;

&lt;p&gt;Now, whenever developers need to create a new MySQL user, MongoDB cluster, or even a secret (and many more different resources), they can do it completely automatically, without waiting for DevOps, by using the Buffet (which we implemented as a Slack bot).&lt;/p&gt;

&lt;p&gt;It's a win-win. Developers move much faster without waiting for DevOps, and DevOps has more capacity to focus on real work instead of manual support and acting as a help desk.&lt;/p&gt;

&lt;p&gt;But is that all? No.&lt;/p&gt;

&lt;p&gt;Self-service solves symptoms. It doesn't solve standardization.&lt;br&gt;
Platform Engineering doesn't end here, and this is also where the real problems usually start.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenges We Didn't Expect
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Saying No
&lt;/h2&gt;

&lt;p&gt;Migrating to a self-service portal instead of handling every request manually isn't always smooth. This is where DevOps needs to start saying "No."&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;"Can you please create me a MySQL user? It's urgent."&lt;br&gt;
"We're moving all MySQL users to the self-service platform. If we keep doing this manually, we'll never finish building it. You'll need to use the Buffet."&lt;/p&gt;

&lt;p&gt;This is usually the point where developers start feeling that DevOps is blocking them instead of helping.&lt;br&gt;
But in order to invest in innovation, you need to stop doing things manually, even if it means developers will temporarily have to wait.&lt;/p&gt;

&lt;p&gt;Short-term friction. Long-term acceleration.&lt;/p&gt;

&lt;p&gt;It's not always pleasant, but it's necessary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lack of Standard
&lt;/h2&gt;

&lt;p&gt;That was just the tip of the iceberg. What came next hit us harder than expected.&lt;br&gt;
The root of almost all our headaches was lack of standardization.&lt;br&gt;
Each team did things their own way, which made organization-wide improvements painful. That included:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;br&gt;
Not all services followed organization or AWS best practices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;br&gt;
Teams wrote CI/CD pipelines differently, ran different pre-deploy checks, and deployed to AWS in their own ways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring&lt;/strong&gt;&lt;br&gt;
Alarms, custom metrics, and dashboards varied widely. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Permissions&lt;/strong&gt;&lt;br&gt;
Access was inconsistent and sometimes overly permissive.&lt;/p&gt;

&lt;p&gt;And that was just the beginning.&lt;/p&gt;

&lt;p&gt;Every change felt risky, manual, and error-prone.&lt;/p&gt;

&lt;p&gt;The real kicker was that making an organization-wide change required touching every repository individually. This created friction, extra manual work, and a high risk of mistakes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unicorn Startup Pressure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Doing all this while scaling a unicorn startup in a race for acquisition added even more pressure.&lt;/p&gt;

&lt;p&gt;Legacy services, tight deadlines, and a high-growth environment made the transition especially tricky. There was no clean slate. &lt;/p&gt;

&lt;p&gt;Everything had to keep working while we improved it.&lt;/p&gt;

&lt;h2&gt;
  
  
  So What Did We Do?
&lt;/h2&gt;

&lt;p&gt;We tackled the most impactful challenges first.&lt;br&gt;
Creating a self-service platform immediately eliminated a lot of manual work.&lt;/p&gt;

&lt;p&gt;But as good as the self-service platform was, it handled only one aspect of Platform Engineering. Our biggest challenge remained lack of standardization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building the Golden Path
&lt;/h2&gt;

&lt;p&gt;We started creating a Golden Path, our organization's right way of doing things.&lt;/p&gt;

&lt;p&gt;Our backend-platform team built the MDK (Melio Development Kit) - an internal opinionated CLI that generates and enforces AWS SAM service templates. Similar in spirit to CDK, it helps developers create a standardized template.yaml, which we use to deploy our services.&lt;br&gt;
It wasn't only about building templates faster, although writing SAM templates manually is never fun.&lt;/p&gt;

&lt;p&gt;More importantly, it finally allowed us to define how a service should look.&lt;/p&gt;

&lt;p&gt;Let that sink in for a moment. It's a big deal.&lt;/p&gt;

&lt;p&gt;With the MDK, we unlocked many opportunities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set best practices for AWS architecture, security, logging, tagging, and FinOps&lt;/li&gt;
&lt;li&gt;No more wide IAM permissions&lt;/li&gt;
&lt;li&gt;No more SQS without DLQs&lt;/li&gt;
&lt;li&gt;No more using highly expensive resources without a reason&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From the developer side, this also meant less guesswork and fewer decisions when starting a new service.&lt;/p&gt;

&lt;p&gt;And the best thing?&lt;/p&gt;

&lt;p&gt;Want to add a new feature across the organization? No more opening pull requests on hundreds of repositories, each one looking different.&lt;/p&gt;

&lt;p&gt;Just open one pull request.&lt;/p&gt;

&lt;p&gt;Or so we thought.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adoption Challenges
&lt;/h2&gt;

&lt;p&gt;Developers were comfortable with how they used to work and weren't excited about migrating their services to the MDK.&lt;/p&gt;

&lt;p&gt;The naive solution was setting this as a cross-organization initiative, prioritizing it with product managers, and working closely with R&amp;amp;D.&lt;/p&gt;

&lt;p&gt;In practice, enforcement became necessary.&lt;/p&gt;

&lt;p&gt;The harsh truth is that you can only get so far by asking nicely. To really move forward, you have to define guidelines and enforce them.&lt;/p&gt;

&lt;p&gt;Enforcement can happen at multiple levels:&lt;/p&gt;

&lt;p&gt;Infrastructure guardrails (for example, AWS Service Control Policies)&lt;br&gt;
Deployment blocking for non-compliant services&lt;br&gt;
Clear deprecation timelines for old versions&lt;/p&gt;

&lt;p&gt;Example: announce that developers have one month to start using the MDK. Other deployment methods will be deprecated and blocked. Teams that don't migrate won't be able to deploy new versions.&lt;/p&gt;

&lt;p&gt;It sounds aggressive, but this is often the only way to make progress at scale.&lt;/p&gt;

&lt;p&gt;The same applies to versions. If developers keep using an old version of the MDK, new features won't help. Deprecation and enforced upgrades are necessary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where We Are Now
&lt;/h2&gt;

&lt;p&gt;Today, with both the MDK and the Buffet, which we continuously improve, we're on the right track. But there is still a long way to go.&lt;/p&gt;

&lt;p&gt;One of the clearest examples of why standardization matters is tagging.&lt;/p&gt;

&lt;p&gt;It may look insignificant, but tagging unlocks many capabilities. From ABAC-based permissions, which are critical for least privilege access, to cost allocation per team, and easily finding owners of budget-eating services, tagging is the foundation for everything.&lt;/p&gt;

&lt;p&gt;When moving to a Platform Engineering approach, we always need to operate on two paths in parallel.&lt;/p&gt;

&lt;p&gt;We must define guidelines and enforce them. For example, all new services must include predefined tags, and non-compliant deployments should be blocked (AWS Config can help here).&lt;/p&gt;

&lt;p&gt;At the same time, we must migrate existing services, which often takes much longer.&lt;/p&gt;

&lt;p&gt;The same principle applies to CI/CD, monitoring and observability, and yes, also AI.&lt;/p&gt;

&lt;p&gt;We live in a world with constant AI FOMO. Implementing everything immediately leads to dis-standardization, which is the enemy of Platform Engineering.&lt;/p&gt;

&lt;p&gt;We need to choose the right tools, define guidelines, and invest in proper rollout with training sessions, tutorials, documentation, and even hackathons.&lt;/p&gt;

&lt;p&gt;Just like DevOps, Platform Engineering is a mindset and a methodology. It should be reflected everywhere.&lt;br&gt;
Security, AI, FinOps, CI/CD, monitoring. The same approach applies to all of them.&lt;/p&gt;

&lt;p&gt;Enable, don't block. But make the right way the easiest way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned and Conclusion
&lt;/h2&gt;

&lt;p&gt;Platform Engineering is never done. It's an ongoing journey.&lt;br&gt;
The most important rule is standardization. Define guidelines and enforce them. That's the key.&lt;/p&gt;

&lt;p&gt;Don't do the work for developers. Give them the tools to do it themselves.&lt;/p&gt;

&lt;p&gt;Always prioritize building self-service tools over manual, repetitive work, no matter how urgent it feels, unless it's truly P0.&lt;/p&gt;

&lt;p&gt;Remember that the self-service platform is only part of the story. As big as it is, it's not the whole picture.&lt;/p&gt;

&lt;p&gt;Start as early as you can. It will have a massive impact later, when you need to support many legacy services that don't follow organizational guidelines.&lt;/p&gt;

&lt;p&gt;The beginning will be hard, but it pays off. Platform Engineering boosts productivity, even if at first it slows you down.&lt;/p&gt;

&lt;p&gt;If DevOps scales people, Platform Engineering scales standards.&lt;/p&gt;

&lt;p&gt;And just as important, explain to developers why you're doing this and how it benefits them. You'll need their cooperation.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>platformengineering</category>
      <category>awscommunitybuilders</category>
      <category>fintech</category>
    </item>
    <item>
      <title>The Problem With Cross-Account DB Access (And How Data API Solved It)</title>
      <dc:creator>Orel Bello</dc:creator>
      <pubDate>Fri, 09 Jan 2026 09:34:48 +0000</pubDate>
      <link>https://dev.to/aws-builders/the-problem-with-cross-account-db-access-and-how-data-api-solved-it-4feh</link>
      <guid>https://dev.to/aws-builders/the-problem-with-cross-account-db-access-and-how-data-api-solved-it-4feh</guid>
      <description>&lt;h2&gt;
  
  
  So what is Data API and how can it help you?
&lt;/h2&gt;

&lt;p&gt;You know how it goes.&lt;br&gt;
You get a support ticket from a developer: “Please create a DB user for my service.”&lt;br&gt;
Or worse: “I need to query the DB, can you create me a personal user?”&lt;/p&gt;

&lt;p&gt;It’s not the most exciting task a DevOps engineer can get, but it’s critical for keeping the business running. And when this process does not scale, DevOps quickly becomes a bottleneck.&lt;/p&gt;

&lt;p&gt;So we built a self-service mechanism. A developer can simply use a Slack bot to request a database user, without involving DevOps.&lt;br&gt;
Win for the developer, win for DevOps.&lt;/p&gt;

&lt;p&gt;Under the hood, the architecture was actually pretty solid.&lt;br&gt;
We stored the desired users list in an S3 bucket.&lt;br&gt;
Each change triggered an event notification to a central SNS topic, which kicked off a Step Function in every AWS account.&lt;br&gt;
From there, Lambda functions handled the heavy lifting: creating the DB user with the right permissions (read-only, read-write, admin), generating the secret, and even registering it in the database proxy.&lt;/p&gt;

&lt;p&gt;On paper, this was robust. In reality, distributed systems are never perfect.&lt;/p&gt;

&lt;p&gt;Sometimes an event was delayed. Sometimes a Lambda failed. Sometimes permissions drifted. And occasionally, a user was simply not created.&lt;/p&gt;

&lt;p&gt;The real problem was not the failure itself.&lt;br&gt;
It was visibility.&lt;/p&gt;

&lt;p&gt;From the developer’s perspective, they clicked a button in Slack and… nothing happened. No clear feedback, no way to validate the outcome.&lt;/p&gt;

&lt;p&gt;That is when we decided to build a &lt;strong&gt;Validator workflow&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The idea sounded simple: create a Lambda function that checks whether the DB user exists and report the result back to the user.&lt;/p&gt;

&lt;p&gt;But then reality hit again.&lt;/p&gt;

&lt;p&gt;We needed to validate users across &lt;strong&gt;multiple environments, multiple AWS accounts, and multiple databases&lt;/strong&gt;. And as we all know, connecting to a database usually means one thing: network access from inside the VPC.&lt;/p&gt;

&lt;p&gt;In a multi-account setup, that left us with two main options.&lt;/p&gt;

&lt;p&gt;The first option was VPC peering. Technically possible, but not something we were comfortable with. We did not want to expose production VPCs to other environments just for validation purposes.&lt;/p&gt;

&lt;p&gt;The second option was to deploy a Lambda function in every account and trigger it cross-account. That worked, but now we were managing dozens of Lambdas, IAM roles, permissions, and invocation logic. The validator itself was becoming another distributed system.&lt;/p&gt;

&lt;p&gt;At that point, we stepped back and asked a simple question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if there was a way to query all of our databases from a central account, without VPC peering and without deploying Lambda functions everywhere?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Turns out, there is.&lt;br&gt;
And it is called &lt;strong&gt;Data API&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7575y27aqlgvvm38z2b8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7575y27aqlgvvm38z2b8.png" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Data API is a feature of Amazon Aurora that allows you to interact with a database using API calls, without requiring direct network connectivity to the database VPC. No security groups, no subnets, no peering. Just IAM-authenticated API calls.&lt;/p&gt;

&lt;p&gt;This changes the game.&lt;/p&gt;

&lt;p&gt;Instead of running validation logic inside every VPC, we could run a single Lambda in a central account and query each database directly using Data API. Same code, same workflow, no networking complexity.&lt;/p&gt;

&lt;p&gt;Under the hood, Data API is also what powers the RDS Query Editor in the AWS console. When you run a query from the UI, you are already using it, whether you realize it or not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-account access (the part that actually matters)&lt;/strong&gt;&lt;br&gt;
Because the validator runs in a central account, we still needed a secure way to access databases that live in other environments.&lt;/p&gt;

&lt;p&gt;Data API removes the networking requirement, &lt;strong&gt;but IAM is still enforced.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In practice, this meant creating a &lt;strong&gt;cross-account IAM role&lt;/strong&gt; in each environment. The central account is allowed to assume this role, and the role itself has permissions to call the Data API on the local Aurora cluster.&lt;/p&gt;

&lt;p&gt;We deployed this role to all environments using &lt;strong&gt;Terraform&lt;/strong&gt;, so every account followed the same trust policy and permission boundaries. No manual setup, no snowflakes.&lt;/p&gt;

&lt;p&gt;From the validator’s perspective, the flow is simple:&lt;/p&gt;

&lt;p&gt;Assume the role in the target account&lt;br&gt;
Call the Data API&lt;br&gt;
Run the validation query&lt;br&gt;
Return the result to the user&lt;br&gt;
Data API solves the networking problem.&lt;br&gt;
Cross-account IAM roles solve the permissions problem.&lt;/p&gt;

&lt;p&gt;Together, they let us centralize access without compromising security.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When Data API actually makes sense&lt;/strong&gt;&lt;br&gt;
Data API is not a silver bullet. It has throughput limits, different latency characteristics, and it only works with Aurora. You would not use it for high-volume application traffic.&lt;/p&gt;

&lt;p&gt;But for &lt;strong&gt;control-plane operations&lt;/strong&gt; like validation, auditing, administrative workflows, and platform automation, it is extremely powerful.&lt;/p&gt;

&lt;p&gt;In our case, Data API allowed us to reduce infrastructure complexity, standardize access across environments, and give developers fast, reliable feedback without pulling DevOps into every request.&lt;/p&gt;

&lt;p&gt;Sometimes the best solution is not adding more infrastructure, but removing it.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>rds</category>
      <category>aurora</category>
    </item>
    <item>
      <title>From Bare Metal to Serverless: How to Evolve Your Disaster Recovery Strategy</title>
      <dc:creator>Orel Bello</dc:creator>
      <pubDate>Wed, 23 Jul 2025 12:13:49 +0000</pubDate>
      <link>https://dev.to/aws-builders/from-bare-metal-to-serverless-how-to-evolve-your-disaster-recovery-strategy-1h17</link>
      <guid>https://dev.to/aws-builders/from-bare-metal-to-serverless-how-to-evolve-your-disaster-recovery-strategy-1h17</guid>
      <description>&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Imagine this scenario:&lt;/strong&gt;&lt;br&gt;
You’re working in a successful and even profitable company, you’re using the latest cutting-edges technologies out there, you’re feeling good. Things can’t get any better than this.&lt;/p&gt;

&lt;p&gt;But one day, you wake up in the morning to 10+ missed calls and dozens of messages that yell “Production is Down!!!”.&lt;/p&gt;

&lt;p&gt;You found out that a disaster has occurred (your Data Center was set on fire, there was a regional electricity outage — you name it!), and your entire system is down.&lt;/p&gt;

&lt;p&gt;You are probably thinking by now, “Those are on-prem problems, I’m using AWS — I have nothing to worry about!”&lt;/p&gt;

&lt;p&gt;But what happens if a hacker encrypts your entire environment? Or maybe you used an LLM-generated code that accidentally deleted the Production Database?&lt;br&gt;
Or even a lighter option — AWS’s entire region is down?&lt;/p&gt;

&lt;p&gt;What can you do?&lt;/p&gt;

&lt;p&gt;That’s where Disaster Recovery comes into play.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;About Me&lt;/strong&gt;&lt;br&gt;
I’m Orel Bello, an AWS Community Builder and a passionate DevOps Engineer with over 4 years of experience, including the past 3 years at Melio. My tech journey began during my military service as a Deputy Commander in the Technological Control Center for the Israel Police. After earning a B.Sc. in Computer Science, I started as a Storage and Virtualization Engineer before discovering my true calling in DevOps. Now an AWS Certified Professional in both DevOps and Solutions Architecture, I specialize in building scalable, efficient, and cost-effective cloud solutions.&lt;/p&gt;

&lt;p&gt;One thing you should know about Melio is that our entire architecture is fully serverless. We run a large-scale environment of Lambda functions, and naturally, Lambda has become our go-to solution for nearly every challenge we need to address.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Disaster Recovery?
&lt;/h2&gt;

&lt;p&gt;Disaster recovery (or DR for short) is what it sounds like: recovering from a disaster. There are a lot of use cases that fall into this category.&lt;/p&gt;

&lt;p&gt;The bottom line is that you need to define a workflow to get your application up and running when your main site is down.&lt;/p&gt;

&lt;p&gt;Your Disaster Recovery Plan (DRP) shouldn’t be a separate initiative, it needs to be fully integrated with your architecture and application logic.&lt;/p&gt;

&lt;p&gt;Before you start building your DRP, you need to decide on your desired recovery time objective (RTO) and recovery point objective (RPO).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RTO and RPO&lt;/strong&gt;&lt;br&gt;
RTO and RPO are the main components when designing a DRP, they will decide your DR strategy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Let me explain:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RTO&lt;/strong&gt; is the amount of downtime you have. How long will your system be down?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RPO&lt;/strong&gt; is the amount of data loss that you are willing to endure. What can you afford to lose?
Lower RTO and RPO = less data loss and less downtime, but also (probably) a more expensive DRP.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(Many aspects can affect our decision to choose our desired RPO and RTO, like KPIs, SLAs, or our commitment to our clients and partners).&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
DRP with RTO of 5 hours and RPO of 15 minutes means that we will have a data loss of up to 15 minutes (for example, by taking a scheduled snapshot every 15 minutes), and it will take up to 5 hours to get our system up and running again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rollback or cutover?
&lt;/h2&gt;

&lt;p&gt;One more thing that you need to consider when designing a DRP is Rollback or Cutover.&lt;/p&gt;

&lt;p&gt;Let’s say we’re designing a DRP for an entire regional outage on AWS, we’ve initiated a failover to our backup region, and what happens when our main region is back online?&lt;/p&gt;

&lt;p&gt;Should we go back to our main region, or stay in the new one?&lt;/p&gt;

&lt;p&gt;If we’re dealing with a hacker who encrypted our entire region, we may not have a main region to go back to.&lt;/p&gt;

&lt;p&gt;So it’s really important to ask ourselves those questions before defining our DRP; the answer to those questions will determine our strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does DR work in an on-prem situation?
&lt;/h2&gt;

&lt;p&gt;OK, now we know what a DR is a little better, but before we jump into DR on cloud, let’s get back to the basics.&lt;/p&gt;

&lt;p&gt;How is the good, old-fashioned DR working on-prem?&lt;/p&gt;

&lt;p&gt;On AWS, we can just put a new DR with a few clicks (yes, I know I’m exaggerating)&lt;/p&gt;

&lt;p&gt;But on-prem, that’s a whole different story.&lt;/p&gt;

&lt;p&gt;We have to plan ahead and run our entire workload accordingly.&lt;/p&gt;

&lt;p&gt;What does this mean? Let’s start with a real-life example.&lt;/p&gt;

&lt;p&gt;A while ago, when I served in the Israel Police, we needed a DR for the Israeli 911 emergency center, and the cloud wasn’t an option.&lt;/p&gt;

&lt;p&gt;So we needed to build a new emergency center from scratch, in a different, physically isolated place, with all the required equipment (computers, phones, communications devices — you name it!) and it may seem like a different use case which has nothing to do with a Cloud DR, but the basic principles are all the same when you come to design a DRP for the cloud.&lt;/p&gt;

&lt;p&gt;I wanted to understand how DR behaves on-prem, so I met with a Director of Storage Architecture to shed some light on it.&lt;/p&gt;

&lt;p&gt;When designing a DRP on-prem, we have two main methods:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The first one is to build a DR site within 300 meters of the main one, using FC cables.&lt;/li&gt;
&lt;li&gt;The second one is compatible with a 10 km radius, with a single cable running through the sites.
(Some solutions support extended distances beyond 10 km, but with higher complexity and various downsides, it’s out of scope for this blog post.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When talking about DR on-prem, we also need to choose:&lt;/strong&gt;&lt;br&gt;
Do we want a failover DR, which will be activated only when a disaster occurs?&lt;/p&gt;

&lt;p&gt;Or do we want to utilize our DRP to be fast and resilient, and run on an active-active architecture?&lt;/p&gt;

&lt;p&gt;Active-active means that we have both our main site and backup site working at the same time!&lt;/p&gt;

&lt;p&gt;When using active-active, we want each site to be able to withstand all the traffic routed from the failed site when a disaster occurs. So every site must run only 50% of its capacity at each given time. When needed, it will be able to run 100% with double the traffic (failed site + back site, all at once!), and it’s a huge waste of underutilized resources!&lt;/p&gt;

&lt;p&gt;We have some serious trade-offs here, but on the cloud is it really better?&lt;/p&gt;

&lt;h2&gt;
  
  
  Different strategies and approaches for DR
&lt;/h2&gt;

&lt;p&gt;So, when talking about an AWS DRP, we have a few strategies:&lt;/p&gt;

&lt;p&gt;Backup and restore, pilot light, warm standby, and multi-site, from the lowest cost and the poorest RTO, to the most expensive with the minimal RTO.&lt;/p&gt;

&lt;p&gt;Let’s break them down:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backup and restore:&lt;/strong&gt;&lt;br&gt;
This is the most basic one and pretty straightforward:&lt;/p&gt;

&lt;p&gt;We take snapshots from our RDS every X time (And from our EC2 or Container images — depending on what computing services we are using) and save them on our backup region.&lt;/p&gt;

&lt;p&gt;Pros: It’s the simplest and cheapest one.&lt;/p&gt;

&lt;p&gt;Cons: When we face a disaster, we will need to deploy all of our services from scratch and restore our RDS from a snapshot, which will result in a longer downtime.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pilot light:&lt;/strong&gt;&lt;br&gt;
Similar to backup and restore, but here we keep our core functionality up and running on our backup region, so when we need to initialize a failover to the backup region, it will be faster.&lt;/p&gt;

&lt;p&gt;Of course, as we said before, we get better RTO and the price goes up as well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warm standby:&lt;/strong&gt;&lt;br&gt;
Here we don’t only have our core functionality up and running on our backup region, but our entire scaled-down system is running on our backup region.&lt;/p&gt;

&lt;p&gt;So when a disaster occurs, we just need to scale up our backup environment instead of deploying it from scratch!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-site:&lt;/strong&gt;&lt;br&gt;
Here we have an active-active architecture, we have both our main region and the backup region running our fully scaled-up workload!&lt;br&gt;
This method requires a different approach, and is harder to maintain; think about it, now you have twice the production to give you a headache!&lt;/p&gt;

&lt;p&gt;But you get the ultimate RTO and RPO! You’re always live, and your users won’t be able to tell the difference if your main site is down — you will just need to have double the budget.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F45ekayour8075944vd11.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F45ekayour8075944vd11.png" alt="DR Strategies" width="800" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How is a serverless DR different from traditional DR?
&lt;/h2&gt;

&lt;p&gt;So we learned about DR on-prem and on AWS, but what about serverless?&lt;/p&gt;

&lt;p&gt;Here’s where things get interesting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Trick: Pre-Deployment Without Paying for Idle&lt;/strong&gt;&lt;br&gt;
We can get the benefits of an active-active approach (like minimum RTO), but here’s the trick — we won’t pay for most of our backup resources as long as we don’t use them!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keeping Environments in Sync with Stack Sets&lt;/strong&gt;&lt;br&gt;
So, we can deploy our services ahead of time, making sure they will be ready to serve traffic immediately when needed, but we won’t need to pay for the time they’re IDLE!&lt;/p&gt;

&lt;p&gt;Now we can keep both of our environments up to date by deploying to both of them regularly at the same time with CloudFormation Stack Sets, which allows us to deploy a CloudFormation stack to multiple regions and even multiple accounts!&lt;/p&gt;

&lt;p&gt;Now all of our serverless components are deployed and ready for action, but we have many more resources to take care of, depending on how robust we want our solution to be.&lt;/p&gt;

&lt;p&gt;Let’s tackle them one by one, starting with the most important one — our database (DB)!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Let’s Talk About Databases&lt;/strong&gt;&lt;br&gt;
Without DB, we practically don’t have anything, so it’s one of (if not the most) crucial aspects to pay attention to when designing a DR.&lt;/p&gt;

&lt;p&gt;Like we’ve seen before, we have many different approaches we can take, depending on the trade-off we want between the RTO and the cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High Availability DB Options&lt;/strong&gt;&lt;br&gt;
We can start with cross-region snapshots, to cross-region read replicas (and promote them to master when a disaster occurs), and finally, Aurora Global DB (Or DynamoDB if you are using a NoSQL DB).&lt;/p&gt;

&lt;p&gt;Aurora Global Database ensures rapid recovery (&amp;lt; 1 minute RTO = downtime) with minimal data loss (RPO of 1 second), enabling robust business continuity.&lt;/p&gt;

&lt;p&gt;But despite all that — There is a potential data loss of 1 second of writing because of the Asynchrony replication (if the DB itself is ok, the data will be available on the original cluster as soon as it recovers — like when the entire region is down), so it’s important to pay attention to this.&lt;/p&gt;

&lt;p&gt;Aurora Global DB is sure great! But even if you’re using it (Or any other DB), it’s still extremely important to set up an immutable backup!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Immutable Backups: Your Last Line of Defense&lt;/strong&gt;&lt;br&gt;
You can do so by using AWS Backup, which supports it natively.&lt;br&gt;
The purpose is to have a backup that no one can change or delete! So if, for example, a hacker got into your system, and encrypted/deleted your entire data, you will still have your immutable data to recover from! (It’s recommended to save this copy on another region or even another account!).&lt;/p&gt;

&lt;p&gt;Ok, back to our Serverless DRP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How Do We Know the Region Is Down?&lt;/strong&gt;&lt;br&gt;
How does our system even know that our main environment is down? We can’t use any regional service (like an ELB) for this purpose since they will be down as well if our entire region is on outage; so we have to use a global service — Route53.&lt;/p&gt;

&lt;p&gt;We can set a failover routing policy with health checks and enable automatic failover to our backup region whenever our main environment becomes unavailable to ensure a seamless failover mechanism (and even trigger a Cloudwatch alarm to trigger different actions we need to take when our main site is down).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Other Key Services: S3 and CloudFront&lt;/strong&gt;&lt;br&gt;
Ok, but what about other services? Like S3 bucket or Cloudfront?&lt;/p&gt;

&lt;p&gt;For S3 bucket, it’s pretty simple — we can set up a cross-region replication to our backup region, and all new files will be replicated to the new bucket!&lt;/p&gt;

&lt;p&gt;In Cloudfront distribution, we can set a failover origin, and whenever our main region becomes unavailable, we will automatically route requests to the other distribution.&lt;/p&gt;

&lt;p&gt;But serverless is not just about saving money, it’s about high availability too!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Serverless = Built-in High Availability&lt;/strong&gt;&lt;br&gt;
When we are using traditional computing services, like an EC2, we are bound to a specific AZ, and even if just an AZ will experience an outage — and not the entire region — our system will still be down.&lt;/p&gt;

&lt;p&gt;When using serverless, it’s no longer a concern!&lt;br&gt;
Since we’re using Lambda functions in addition to managed services (Like API Gateway in integration with SQS and SNS, which it’s a common practice to use them in a serverless architecture), we get the Multi-AZ feature natively!&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;You’ve seen what a DR is and how it’s being implemented on-prem, different approaches to DR on Cloud, and finally — how DR is taken to a whole other level when dealing with serverless!&lt;/p&gt;

&lt;p&gt;DRs are now easier than ever to implement, working automatically, and also cheaper!&lt;/p&gt;

&lt;p&gt;DR is one of the most important aspects of your workflow, it’s like insurance — you can not have one for years and nothing will happen, but as soon as something bad happens, you don’t want to be caught without one.&lt;/p&gt;

&lt;p&gt;So whether you choose backup and restore, pilot light, warm standby, or a multi-site, it doesn’t matter — as long as you make sure to implement one!&lt;/p&gt;

&lt;h2&gt;
  
  
  In a best case scenario, a DR will add to your monthly billing, and you won’t see it giving value most of the time. In a worst case scenario it can save your company’s time, money, and reputation when a disaster occurs.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Orel Bello&lt;/em&gt;&lt;br&gt;
DevOps Platform Engineer @ Melio | AWS Community Builder&lt;br&gt;
Passionate about scaling DevOps with simplicity and impact.&lt;/p&gt;

</description>
      <category>serverless</category>
      <category>platformegineering</category>
      <category>aws</category>
    </item>
    <item>
      <title>Building Self Service… using Self Service?</title>
      <dc:creator>Orel Bello</dc:creator>
      <pubDate>Sun, 01 Jun 2025 10:31:26 +0000</pubDate>
      <link>https://dev.to/aws-builders/building-self-service-using-self-service-1jci</link>
      <guid>https://dev.to/aws-builders/building-self-service-using-self-service-1jci</guid>
      <description>&lt;p&gt;In Platform Engineering, our mission is clear:&lt;br&gt;
&lt;strong&gt;Build tools that help developers move fast and independently - without waiting for DevOps or becoming a bottleneck.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At Melio, a team of just 5 DevOps supports over 200 developers. That scale leaves no room for manual work. Self-service isn’t a nice-to-have - it’s survival.&lt;/p&gt;

&lt;p&gt;Usually, building self-service starts by identifying repetitive requests and automating them. But what if we could take it one step further?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if we could automate the automation itself?&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;One of the most common needs at Melio is creating &lt;strong&gt;Self-Service Runners&lt;/strong&gt; - little automations developers can trigger on demand.&lt;/p&gt;

&lt;p&gt;Each runner used to require a bunch of steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cloning a GitHub template&lt;/li&gt;
&lt;li&gt;Customizing SAM template&lt;/li&gt;
&lt;li&gt;Updating Lambda code to handle parameters and logic&lt;/li&gt;
&lt;li&gt;Creating a Slack modal for input&lt;/li&gt;
&lt;li&gt;Hooking it all into CircleCI and deploying&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s a lot. Too much.&lt;br&gt;
So… we built a runner that builds runners.&lt;/p&gt;




&lt;p&gt;With just a few inputs - runner name, a JSON schema for inputs, and optional Terraform config - this tool does it all:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spins up a GitHub repo from a template&lt;/li&gt;
&lt;li&gt;Opens a PR with all code changes&lt;/li&gt;
&lt;li&gt;Builds the Slack Modal automatically&lt;/li&gt;
&lt;li&gt;Wires it all to CI/CD and deploys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fift00nn964xv8j6wtv49.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fift00nn964xv8j6wtv49.png" alt="Using the new runner" width="800" height="983"&gt;&lt;/a&gt; &lt;/p&gt;




&lt;p&gt;Think of it like a buffet for infrastructure.&lt;br&gt;
Developers choose what they need, and automation serves it up instantly.&lt;/p&gt;

&lt;p&gt;And the best part?&lt;br&gt;
We use &lt;strong&gt;Bedrock&lt;/strong&gt; to inject parameters dynamically into Terraform files - no more writing custom logic for every use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj0wflk2jqc6cutnj7lgz.png" alt="Before and After" width="800" height="533"&gt;
&lt;/h2&gt;

&lt;p&gt;We’re not just building tools.&lt;br&gt;
&lt;strong&gt;We’re building tools that build tools.&lt;/strong&gt;&lt;br&gt;
That’s what Platform Engineering looks like when AI becomes part of the stack.&lt;/p&gt;

&lt;p&gt;We didn’t reinvent the wheel. No complex systems, no massive overhead. Just practical automation that scales - fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Buffet: Behind the Scenes&lt;/strong&gt;&lt;br&gt;
We call our self-service portal The Buffet because it empowers developers to “serve themselves” — instantly and independently. Whether it’s provisioning AWS resources, spinning up a database, or managing secrets, developers just make a request and automation takes care of the rest.&lt;/p&gt;

&lt;p&gt;It's built around a simple but powerful backbone: &lt;strong&gt;Lambda, SNS, SQS, GitHub PRs, and Terraform via Env0.&lt;/strong&gt;&lt;br&gt;
It’s not flashy — but it works incredibly well.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr284ro37r7yasxqb7tul.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr284ro37r7yasxqb7tul.webp" alt="Buffet Diagram" width="717" height="506"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since introducing The Buffet, we’ve offloaded hundreds of support tickets.&lt;br&gt;
DevOps interruptions are down, and developer autonomy is way up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Thoughts&lt;/strong&gt;&lt;br&gt;
The result?&lt;br&gt;
With just a few clicks, we can spin up a fully functional self-service runner — production-ready and developer-friendly.&lt;/p&gt;

&lt;p&gt;It’s not just faster.&lt;br&gt;
It’s a complete shift in how we support scale.&lt;br&gt;
And it’s already boosting productivity across the board.&lt;/p&gt;

&lt;p&gt;Would love to hear how others are approaching self-service at scale. Feel free to comment or connect 🙌&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Orel Bello&lt;/em&gt;&lt;br&gt;
DevOps Platform Engineer @ Melio | AWS Community Builder&lt;br&gt;
Passionate about scaling DevOps with simplicity and impact.&lt;/p&gt;

</description>
      <category>serverless</category>
      <category>platformegineering</category>
      <category>aws</category>
      <category>ai</category>
    </item>
    <item>
      <title>Unlocking Efficiency Through Lambda-Powered Workflows</title>
      <dc:creator>Orel Bello</dc:creator>
      <pubDate>Sun, 20 Apr 2025 12:28:40 +0000</pubDate>
      <link>https://dev.to/aws-builders/unlocking-efficiency-through-lambda-powered-workflows-408a</link>
      <guid>https://dev.to/aws-builders/unlocking-efficiency-through-lambda-powered-workflows-408a</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxtn19iocnurz46ylwtu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxtn19iocnurz46ylwtu.png" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Struggling to balance support tickets and innovation? Discover how a small DevOps team leverages simple Lambda-powered workflows to empower 200+ developers and unlock massive efficiency.
&lt;/h4&gt;

&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;How do you manage endless support tickets while still focusing on innovation?&lt;/p&gt;

&lt;p&gt;Not every task we handle is thrilling or exciting. Not every task is blog-post material. Sometimes, we deal with less glamorous missions, like saving money on CloudWatch log storage, offboarding a developer, securing access to a sensitive S3 bucket, disabling unused IAM roles, implementing a code freeze solution, and more.&lt;/p&gt;

&lt;p&gt;And it doesn’t stop there — sometimes, we even get support tickets that seem endless: granting missing IAM permissions, creating MongoDB or RDS clusters and users, setting up AWS Personal Accounts, creating ECR repositories, secrets, and so much more.&lt;/p&gt;

&lt;p&gt;How can just one DevOps, or even a few, manage all of this in addition to daily tasks?&lt;/p&gt;

&lt;p&gt;Here’s where it gets interesting: we’re a team of only 5 DevOps, responsible for over 200 developers.&lt;/p&gt;

&lt;p&gt;What if there were a simple way to automate these tasks or, even better, empower developers to handle them on their own?&lt;/p&gt;

&lt;p&gt;Well, what if I told you there &lt;em&gt;is&lt;/em&gt; a way? A simple way.&lt;/p&gt;

&lt;p&gt;If I were to give this blog post another title, it would be &lt;em&gt;The Power of Simplicity.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You won’t find a complex architecture with thousands of lines of code here. Instead, you’ll see the most basic and straightforward solutions — the kind that are often the most effective.&lt;/p&gt;

&lt;h3&gt;
  
  
  About Me
&lt;/h3&gt;

&lt;p&gt;Before we dive in, let me introduce myself.&lt;/p&gt;

&lt;p&gt;I’m Orel Bello, an AWS Community Builder and a passionate DevOps Platform Engineer with over 3.5 years of experience, including the past 2.5 years at Melio. My tech journey began during my military service as a Deputy Commander in the Technological Control Center for the Israel Police. After earning a B.Sc. in Computer Science, I started as a Storage and Virtualization Engineer before discovering my true calling in DevOps. Now an AWS Certified Professional in both DevOps and Solutions Architect, I specialize in building scalable, efficient, and cost-effective cloud solutions.&lt;/p&gt;

&lt;p&gt;One thing you should know about Melio is that our entire architecture is fully serverless. We run a large-scale environment of Lambda functions, and naturally, Lambda has become our go-to solution for nearly every challenge we need to address.&lt;/p&gt;

&lt;h3&gt;
  
  
  Understanding Lambda Functions
&lt;/h3&gt;

&lt;p&gt;Let’s take a look at what Lambda functions are and how they can help us boost efficiency through automation.&lt;/p&gt;

&lt;p&gt;We’re all familiar with Lambda functions — the serverless compute service that lets you focus on writing code instead of managing servers.&lt;/p&gt;

&lt;p&gt;Lambda integrates natively with many AWS services, making it the perfect tool for automation.&lt;/p&gt;

&lt;p&gt;You can trigger Lambda functions on demand, or by a variety of AWS services like EventBridge, SNS, SQS, API Gateway, and many more.&lt;/p&gt;

&lt;p&gt;And the best part? You don’t need to be an expert developer to write automations. All you need is a solid understanding of basic Python and the legendary boto3 library.&lt;/p&gt;

&lt;p&gt;Boto3 is the engine behind the AWS CLI that we all know and love. It lets you perform actions on AWS with ease.&lt;/p&gt;

&lt;p&gt;And here’s the kicker — it’s already included in Lambda, so no additional layer is required!&lt;/p&gt;

&lt;p&gt;So, what can you do with it?&lt;/p&gt;

&lt;p&gt;Basically — everything!&lt;/p&gt;

&lt;p&gt;Let me show you just how simple it can be.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Case 1: Implementing a Codefreeze Solution
&lt;/h3&gt;

&lt;p&gt;Let’s talk about the Code Freeze.&lt;/p&gt;

&lt;p&gt;We always want our production environment to be stable and error-free. But there are certain critical periods, like when we’re presenting a live demo to partners, where we can’t afford the risk of a developer accidentally deploying to production and causing issues. During these times, we need to block all deployments based on a schedule automatically — and, most importantly, make it easy to enable or disable the block if a hotfix is needed in production.&lt;/p&gt;

&lt;p&gt;Here’s the simplest solution for this:&lt;/p&gt;

&lt;p&gt;Let’s break it down into three parts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Scheduling&lt;/strong&gt;  — For scheduling, we can use EventBridge, which allows us to use CRON expressions to trigger our Lambda function at specific times.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blocking&lt;/strong&gt;  — Since all of our services are deployed through CloudFormation stacks, blocking all deployments is as simple as denying CloudFormation actions (Such as CreateStack and UpdateStack). We can achieve this using SCP (Service Control Policies).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda&lt;/strong&gt;  — This is the bridge between EventBridge and SCP.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In short, we write a simple Lambda function to attach the SCP policy and trigger it using EventBridge (And of course, another lambda function and Eventbridge to disable the code freeze), It’s as easy as that!&lt;/p&gt;

&lt;p&gt;Automating the code freeze mechanism not only helps safeguard stability but also simplifies the process and reduces the chances of human error during those critical times.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Case 2: Developer Offboarding Automation
&lt;/h3&gt;

&lt;p&gt;Alright, that was simple, but what about offboarding a developer?&lt;/p&gt;

&lt;p&gt;At Melio, every developer has a Personal AWS Account and a Personal Atlas MongoDB cluster. When they leave the company, we need to delete these resources for two key reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt; : We want to make sure no backdoors are left open.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Optimization&lt;/strong&gt; : Resources that are no longer in use should be terminated.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Don’t worry, it’s just as straightforward as before.&lt;/p&gt;

&lt;p&gt;The first step is to use EventBridge integrated with CloudTrail to capture the DisableUser event, which tells us a developer has left the company.&lt;/p&gt;

&lt;p&gt;Next, we need to clean up the AWS resources before closing the account.&lt;/p&gt;

&lt;p&gt;Why not just close the account right away? We deploy third-party resources, like the Twingate connector, when creating a personal AWS account. We’ll need to run a terraform destroy before closing the account to terminate those external resources.&lt;/p&gt;

&lt;p&gt;How do we do this?&lt;/p&gt;

&lt;p&gt;We simply send an API request (using the requests library, so we’ll need a Lambda layer for that) to Env0 (our Terraform platform). Once the destroy operation is complete (we can implement a simple wait mechanism with a step function), we close the AWS account with a basic Boto3 command. Afterward, we make an API call to MongoDB to delete the cluster, and that’s it.&lt;/p&gt;

&lt;p&gt;It’s an easy workflow, and aside from the additional Lambda layer for the Python requests library, everything else is native to AWS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Case 3: CloudWatch Logs Cost Optimization
&lt;/h3&gt;

&lt;p&gt;Let’s look at one more use case.&lt;/p&gt;

&lt;p&gt;At Melio, we store log groups in CloudWatch to meet compliance requirements. However, CloudWatch can be expensive, so we came up with a more cost-effective solution: exporting log groups to S3, which is a much cheaper storage option.&lt;/p&gt;

&lt;p&gt;The catch? There isn’t a native way to do this automatically, like with the lifecycle rule for S3 buckets, so we had to build our own solution.&lt;/p&gt;

&lt;p&gt;Let’s break it down:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F720%2F0%2Aiegz39AnVS0iO2XH" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F720%2F0%2Aiegz39AnVS0iO2XH" width="720" height="607"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. DynamoDB Table Creation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create a DynamoDB table containing the names of all log groups. This table acts as a registry for managing the export process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Export Task Initialization:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Retrieve the last item from the DynamoDB table, initiating an export task for the corresponding log group. Subsequently, remove the item from the table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Set Retention Policy:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Apply a retention policy of 3 months to the log group that was exported successfully, ensuring that only relevant data is retained in CloudWatch&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Task Status Monitoring:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Check if the DynamoDB table is empty. If it is, the export process is complete. If not, wait for 15 minutes and monitor the status of the ongoing export task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Task Completion Check:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the export task is marked as done, start the next export task. If not, wait for 15 minutes and recheck the status.&lt;/p&gt;

&lt;p&gt;We created a systematic approach to ensure log groups are exported to S3, reducing costs while still meeting compliance. The process runs periodically — every three months — ensuring that only the necessary data stays in CloudWatch. This results in significant cost savings over time while still staying compliant with our requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Buffet: A Self-Service Solution
&lt;/h3&gt;

&lt;p&gt;While Lambda saves time through automation, how can we address on-demand developer requests without creating bottlenecks?&lt;/p&gt;

&lt;p&gt;That’s where &lt;strong&gt;The Buffet&lt;/strong&gt; comes in — a self-service portal powered by Lambda functions.&lt;/p&gt;

&lt;p&gt;The Buffet empowers developers to work more efficiently without waiting for DevOps, removing the bottleneck and allowing them to perform tasks independently. It’s all about making their lives easier and letting them do what they need to do, without any dependency on DevOps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F720%2F0%2ACodNuGL8mSxeW9M4" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F720%2F0%2ACodNuGL8mSxeW9M4" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;We’ve set up an interface where developers can submit their requests (we use Slack, but you can use any tool you prefer).&lt;/p&gt;

&lt;p&gt;Once a request is made, it’s sent via API Gateway into our AWS account. From there, we trigger an SNS topic, which sends the request to multiple SQS queues — one for each runner (i.e., self-service action). The relevant Lambda function pulls from the SQS queue and performs the actions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F717%2F0%2AjYK2EXhTewlUsLzr" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F717%2F0%2AjYK2EXhTewlUsLzr" width="717" height="506"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Implementation Details&lt;/p&gt;

&lt;p&gt;That covers the infrastructure, but what about the logic for the runners?&lt;/p&gt;

&lt;p&gt;It’s simpler than you might think.&lt;/p&gt;

&lt;p&gt;We’ve identified the most frequently requested tasks and automated them. These are often day-one operations, like creating AWS personal accounts, ECR repositories, Secrets, RDS clusters, MongoDB clusters, and more.&lt;/p&gt;

&lt;p&gt;What do all of these tasks have in common?&lt;br&gt;&lt;br&gt;
They all create resources using Terraform. And since the Terraform code is stored in a Git repository, we just fetch the relevant file, append the new resource, create a pull request, and after the merge, Env0 applies the changes.&lt;/p&gt;

&lt;p&gt;This simple but powerful architecture allows us to automate the creation of resources and easily add new runners without hassle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lambda function → Modify the Terraform repo → Create a PR → Apply with Env0.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Benefits
&lt;/h3&gt;

&lt;p&gt;Using Buffet is a win-win for everyone.&lt;/p&gt;

&lt;p&gt;Developers no longer need to wait on DevOps for support requests and can focus solely on development, free from bottlenecks. Meanwhile, DevOps can shift focus to more impactful tasks instead of handling repetitive support.&lt;/p&gt;

&lt;p&gt;Creating a Self-Service portal can significantly ease the day-to-day load on DevOps and streamline workflows for everyone.&lt;/p&gt;

&lt;p&gt;While it does require effort, and building new runners is simple, creating the portal itself will take some time.&lt;/p&gt;

&lt;p&gt;However, it can empower your team and skyrocket productivity. The impact can be so huge, it’s like adding a new DevOps engineer to your team, handling the heavy lifting!&lt;/p&gt;

&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;p&gt;From a simple code freeze mechanism to comprehensive workflows, Lambda functions empower DevOps engineers to streamline their processes. Whether it’s using EventBridge for triggers, Step Functions for orchestration, or Slack for user interfaces, these tools make balancing efficiency and simplicity feel effortless.&lt;/p&gt;

&lt;p&gt;Ready to simplify your workflows? Start small — automate just one task and watch the impact it has. With every step forward, you’ll uncover the incredible power of simplicity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flel8qsr3xa28icmifn1t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flel8qsr3xa28icmifn1t.png" width="800" height="206"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Visit our career website&lt;/em&gt;&lt;/p&gt;




</description>
      <category>awscommunitybuilder</category>
      <category>lambda</category>
      <category>serverless</category>
      <category>aws</category>
    </item>
    <item>
      <title>The journey to your first Tech Role</title>
      <dc:creator>Orel Bello</dc:creator>
      <pubDate>Sun, 20 Apr 2025 10:15:23 +0000</pubDate>
      <link>https://dev.to/aws-builders/the-journey-to-your-first-tech-role-5hc9</link>
      <guid>https://dev.to/aws-builders/the-journey-to-your-first-tech-role-5hc9</guid>
      <description>&lt;p&gt;So you’ve just finished your bachelor’s degree. Now what? With so many different fields in the industry, how can you choose what type of role to pursue? Do you know all the roles that are out there? It might be natural to try and apply to every available position out there–just to get in. And while there isn’t a single correct answer, there are some crucial aspects you’ll need to pay attention to before you apply–and choose the right role for you.&lt;/p&gt;

&lt;p&gt;First, let’s begin with my own journey. I’m Orel Bello, an AWS Community Builder and a passionate DevOps Engineer.&lt;br&gt;
My tech journey began during my military service as a Deputy Commander in the Technological Control Center for the Israel Police. After earning a B.Sc. in Computer Science, I started as a Storage and Virtualization Engineer before discovering my true calling in DevOps. Now an AWS Certified Professional in both DevOps and Solutions Architect, I specialize in building scalable, efficient, and cost-effective cloud solutions.&lt;/p&gt;

&lt;p&gt;I have a passion for assisting individuals in finding their next position, and through this blog post, I aspire to reach and help as many people as possible.&lt;/p&gt;

&lt;p&gt;Now, let’s dive in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvla27fqpzb4kasjg69x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvla27fqpzb4kasjg69x.png" alt="Illustration" width="626" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Self Confidence:
&lt;/h2&gt;

&lt;p&gt;The primary thing you need when job-seeking is self-confidence. One common sentence I often hear from individuals attempting to enter the High Tech industry is, “Why should anyone hire me? What do I have to offer?”&lt;/p&gt;

&lt;p&gt;Let HR decide if you’re a good fit for the job; don’t do it for them. Believe in yourself and take pride in your accomplishments; don’t underestimate their value.&lt;/p&gt;

&lt;h2&gt;
  
  
  Preferred Field:
&lt;/h2&gt;

&lt;p&gt;Now, it’s time to choose your preferred field. It’s okay if you don’t have one yet. Consider two essential aspects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What advantages do you have compared to others? Think about your unique experiences, such as self-projects, military service, or Udemy courses, which may not be traditionally defined as experience but are valuable nonetheless.&lt;/li&gt;
&lt;li&gt;Identify what you enjoy doing and what you excel at. If you can think of several fields of interest, that’s a good starting point. Make sure to research these fields thoroughly, and be ready with an answer when the interviewer asks why you want to be an X? If you won’t have a good answer, trust me, they will know.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Resume:
&lt;/h2&gt;

&lt;p&gt;Create multiple versions of your CV for each position. For instance, if your technological stack includes C++, Java, Assembly, Python, and Android development, and you’re applying for a Data Scientist position, many of these skills might be irrelevant. It’s best to focus on Python and provide more detail about it, rather than adding unrelated programming languages to your resume.&lt;/p&gt;

&lt;p&gt;Apply to every related position, even if you meet only 30% of the requirements. Don’t hesitate just because you’re missing a few details; apply anyway. After the process, if they want you, you’ll be in a strong position to negotiate.&lt;/p&gt;

&lt;h2&gt;
  
  
  LinkedIn:
&lt;/h2&gt;

&lt;p&gt;One of the first challenges you will encounter will be getting an interview. So, ensure you have a proper LinkedIn profile. Here are some tips:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with a good profile picture and a valid title (if you’re currently unemployed, you can use a future position title, like ‘Junior XX’).&lt;/li&gt;
&lt;li&gt;Connect with individuals related to your preferred fields, such as people working in companies where you’d like to work, senior programmers in your desired position, or HR professionals from various companies.&lt;/li&gt;
&lt;li&gt;Aim for at least 500 connections, but focus on making valid connections rather than adding everyone you come across.&lt;/li&gt;
&lt;li&gt;Provide details about your technical skills, technological stack, prior experience (even if it doesn’t directly relate to your preferred fields), education, courses, and certifications.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsi4tvjs342fjek1y0bak.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsi4tvjs342fjek1y0bak.png" alt="Illustration" width="626" height="358"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Interviews:
&lt;/h2&gt;

&lt;p&gt;Interviews can be quite intimidating at first, but with enough experience, you’ll discover that most of them follow a similar pattern, and you’ll be able to handle them almost effortlessly. Your elevator speech is crucial and should be a 60-second introduction where you persuade the interviewer to hire you. It should include a brief summary of your experience, knowledge, strengths, and what you’ll bring to the position.&lt;/p&gt;

&lt;p&gt;Prepare a project you’re proud of, whether it’s from previous roles, college, or personal time, and be ready to discuss it. Anticipate questions like ‘Why did you choose to implement it that way?’ You’ll be asked general knowledge questions, which you can practice on platforms like Glassdoor. Or, you might receive a situational question for which you’ll need to debug a problem or propose a solution.&lt;/p&gt;

&lt;p&gt;If you’re uncertain about an answer to a general knowledge question, it’s acceptable to admit it, but avoid making something up. For situational questions, it’s recommended to think out loud, saying something like “Hmm, let’s see” or “Let’s think together,” and then propose possible solutions.&lt;/p&gt;

&lt;p&gt;Gain experience through interviews. Each one will enhance your chances, knowledge, and self-confidence. Even if an interview goes poorly, view it as a learning experience to improve for the next time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Improve yourself:
&lt;/h2&gt;

&lt;p&gt;What can you do until you get your first position? In Israel, it may take 6 months up to a year and a half to find your first job [&lt;a href="https://www.globes.co.il/news/article.aspx?did=1001377385" rel="noopener noreferrer"&gt;Hebrew&lt;/a&gt;]. In the meantime, consider these steps to make yourself more marketable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Certifications: In the IT and DevOps fields, certifications are common and can boost your resume. Look for certifications in your spare time, like AWS, to expand your knowledge and skills. Some certifications are valid for a lifetime, while others last for three years. There are different levels, such as Associate, Specialist, or Professional.&lt;/li&gt;
&lt;li&gt;Projects and hands-on experience: Work on projects and gain hands-on experience. Showcase your projects on GitHub to demonstrate your skills to potential employers.&lt;/li&gt;
&lt;li&gt;Networking: Attend meetups to learn and meet professionals from the industry. Networking can lead to new connections and opportunities in the future.&lt;/li&gt;
&lt;li&gt;Online learning: Utilize platforms like Udemy for affordable or free courses to gain relevant skills.&lt;/li&gt;
&lt;li&gt;Consider an entry-level position: Starting in an entry-level position related to your preferred field can be beneficial. It might not be your dream job, but the experience gained can set you on the right career path. For instance, if you aim to be a DevOps engineer, positions like Automation or IT (even a help-desk role) can be a stepping stone. However, if your goal is to become a Data scientist, starting as QA might not be so beneficial.&lt;/li&gt;
&lt;li&gt;&lt;u&gt;Consider enrolling in a bootcamp, but be aware of the two kinds available:&lt;/u&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Paid bootcamps:&lt;/strong&gt; These can be expensive, costing a few thousand dollars or even more (up to 20,000 NIS). They do not guarantee a job at the end of the course, but you can quit without an additional fee (although you won’t get a refund for the course fee).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Company-backed bootcamps:&lt;/strong&gt; Some bootcamps are free and offer guaranteed job placement at their company or other partnering firms, but this is usually limited to exceptional students. If you join, be prepared to work at the company for 2–3 years, often at a lower salary compared to other companies. Quitting early may result in a significant penalty (up to 90,000 NIS).&lt;/li&gt;
&lt;li&gt;As a recommendation, bootcamps can be optional and may vary in value depending on your field of interest. They can be beneficial for those without prior experience or relevant education (like a bachelor’s degree). However, if you have self-discipline and can learn independently, it might be better to consider a bootcamp as a last resort.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Never give up
&lt;/h2&gt;

&lt;p&gt;Securing your first tech role may not always be easy. However, the key is to never give up, keep trying and put in effort every single day. Stay committed to the process, and success will come eventually! Best of luck!&lt;/p&gt;

</description>
      <category>career</category>
      <category>development</category>
      <category>devops</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Pay Less For Serverless: Practical Tips</title>
      <dc:creator>Orel Bello</dc:creator>
      <pubDate>Mon, 21 Oct 2024 11:18:02 +0000</pubDate>
      <link>https://dev.to/aws-builders/pay-less-for-serverless-practical-tips-jcg</link>
      <guid>https://dev.to/aws-builders/pay-less-for-serverless-practical-tips-jcg</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0jcsj1g9wd37ij4uh4l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0jcsj1g9wd37ij4uh4l.png" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intro&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We all know the benefits of using serverless architecture, the concept is pretty simple: we pay AWS for managing the infrastructure for us so that we can focus solely on developing, instead of handling and maintaining the servers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But what about the costs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In a small environment with infrequent access, the serverless architecture can actually save you money — for example, when you don’t have traffic, the environment scales to zero and you don’t pay at all.&lt;/p&gt;

&lt;p&gt;But in a large environment, such as ours at Melio (where all of our architecture is serverless), the price can spike and reach over $100K monthly on the Lambda functions alone, so what can we do to optimize it?&lt;/p&gt;

&lt;p&gt;The first thing we need to do is to determine which services will be used in a serverless architecture, and then we can see how to optimize them.&lt;/p&gt;

&lt;p&gt;This blog post will explore the various strategies for cost optimization in a serverless architecture, focusing on services and best practices to ensure efficient spending.&lt;/p&gt;

&lt;h3&gt;
  
  
  Who am I and why do I care about cloud costs?
&lt;/h3&gt;

&lt;p&gt;My name is Orel Bello, and for the last two years, I’ve been working as a DevOps Engineer at Melio. I’m an AWS Certified Solution Architect Professional and Melio’s focal point for FinOps.&lt;/p&gt;

&lt;p&gt;Since I started using AWS, I have been paying attention to every resource price, as it is a big part of the AWS Solution Architect Associate certification that I went through at the beginning of my cloud journey, so I knew we had a lot to cut from.&lt;/p&gt;

&lt;p&gt;Recently Melio started the enrollment process for AWS EDP (Enterprise Discount Plan), which requires cost optimization before, so let’s start saving money:&lt;/p&gt;

&lt;h4&gt;
  
  
  Lambda Pricing
&lt;/h4&gt;

&lt;p&gt;Before optimizing Lambda costs, it’s important to understand the pricing model.&lt;/p&gt;

&lt;p&gt;You are charged based on execution time (measured in milliseconds) and the amount of memory allocated.&lt;/p&gt;

&lt;p&gt;For example, a function with 128MB of memory (which costs $0.0000000021 per millisecond) and an execution time of 3 seconds would cost ($0.0000000021 * 3000 =) $0.0000063 per invocation.&lt;/p&gt;

&lt;p&gt;If you double the memory and halve the execution time, the cost will remain roughly the same. However, the performance improvement might vary depending on the task.&lt;/p&gt;

&lt;p&gt;Remember, each Lambda function handles only one request at a time. Therefore, more requests lead to more invocations, which increases costs.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Introducing AWS Lambda Power Tuning:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So you just created a new Lambda function, how do you choose how much RAM you need to allocate? (While you can’t directly adjust vCPU values, increasing RAM indirectly enhances vCPU performance too.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/alexcasalboni/aws-lambda-power-tuning" rel="noopener noreferrer"&gt;This open-source tool&lt;/a&gt; can help you optimize your Lambda function and suggest the best power configuration to minimize cost and/or maximize performance.&lt;/p&gt;

&lt;p&gt;It will run your function on a benchmark, suggesting the best values for RAM, and will also show the average execution time.&lt;/p&gt;

&lt;p&gt;So by increasing RAM based on the results, you can make your Lambda function run faster, and you’ll pay less (or at least the same) because the execution time is reduced.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2Ajv0wCmKdAuuYSZOy" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2Ajv0wCmKdAuuYSZOy" width="1024" height="725"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Why not just set the timeout to the max value?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Setting the timeout to the maximum can be costly because you are charged for every millisecond your Lambda function runs. If an error occurs and the function simply waits for the timeout (for example, when you’re accessing an unresponsive API), you will incur unnecessary charges. Therefore, it’s crucial to set the timeout to fit your specific needs.&lt;/p&gt;

&lt;p&gt;To determine the correct timeout value for your Lambda function, you can use CloudWatch metrics or the Lambda Power tool. These tools provide the average execution time, allowing you to add a buffer for safety and set an appropriate timeout value.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AEpSLjE3PoGWUI6tw" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AEpSLjE3PoGWUI6tw" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Don’t put all your code inside the Lambda handler&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Lambda functions operate within a virtual environment that persists across invocations, known as a microVM. However, it’s crucial to note that the main function code (the handler) is executed fresh each time it’s called. If you set up resources like a database connection within the handler, they are recreated with every call, slowing performance and potentially increasing costs.&lt;/p&gt;

&lt;p&gt;To improve performance and cut expenses, it’s best practice to set up lasting resources, such as database connections, outside the handler. This enables subsequent invocations to reuse these established resources, leading to quicker execution and savings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F498%2F0%2AyXrNiobl3F2LPCgy" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F498%2F0%2AyXrNiobl3F2LPCgy" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Migrate to ARM-based AWS Graviton processor:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using ARM architecture with Graviton processors instead of x86 processors can reduce the overall cost of your Lambda function by up to 20% while improving performance by 19%!&lt;/p&gt;

&lt;p&gt;The migration itself is pretty simple, and unless you have some dependencies or libraries that are using x86, you don’t need to take any further steps while migrating to the graviton processor.&lt;/p&gt;

&lt;p&gt;Of course, it’s always best practice to run tests on Dev environments first before making changes on Production, but the transition itself should be pretty seamless.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AN2g8tCGJvI8Sbsfr" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AN2g8tCGJvI8Sbsfr" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Provisioned Concurrency — Don’t use it recklessly!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Provisioned Concurrency keeps your Lambda functions ‘warm’ and ready for action, making them execute faster by eliminating cold starts and improving performance.&lt;/p&gt;

&lt;p&gt;It’s important to pay attention that you’re billed based on the number of provisioned concurrency units and the duration they’re active, and if you use it recklessly, it can become very expensive.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F890%2F0%2AF677c9yaDu9j8CDA" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F890%2F0%2AF677c9yaDu9j8CDA" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So what do you need to do?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use Provisioned Concurrency only for production workloads with user-facing functions, and avoid using it in development environments.&lt;/li&gt;
&lt;li&gt;Provision the minimum required amount of concurrency that your function will need (by analyzing application traffic patterns and performance requirements, you can use Cloudwatch &lt;strong&gt;&lt;em&gt;ProvisionedConcurrencyUtilization&lt;/em&gt;&lt;/strong&gt; metric for that). Remember that over-provision will just cause extra costs.&lt;/li&gt;
&lt;li&gt;Use the auto-scaling feature of Provisioned Concurrency to gradually scale your function based on utilization, ensuring you avoid over-provisioning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also, functions with shorter execution times require less Provisioned Concurrency, so if you optimize your code and RAM configuration, and lower your execution time, you can also save money on the Provisioned Concurrency.&lt;/p&gt;

&lt;p&gt;Remember: A serverless environment will not cost you money when there is no traffic, but you will pay for the provisioned concurrency! So even if you have an inactive environment, you must take it into account.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Don’t Use Sleep:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Did you ever need to wait for an operation that is running outside the Lambda function to finish? Did you use ‘sleep’ while you wait?&lt;/p&gt;

&lt;p&gt;For those of you who aren’t familiar with the sleep method, it’s pretty straightforward — you specify the amount of time you want the function to wait for the external operation to finish.&lt;/p&gt;

&lt;p&gt;So why is it bad practice to use it inside a Lambda function?&lt;br&gt;&lt;br&gt;
As you may already guess, it’s because we pay for the time that the Lambda function is waiting for.&lt;/p&gt;

&lt;p&gt;So what can we do instead of using sleep?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Introducing Step Functions:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Step Function is a serverless orchestration service that integrates natively with Lambda function and a lot of other services, and lets you create a workflow like a state machine.&lt;/p&gt;

&lt;p&gt;This can help us divide a large Lambda function that needs to wait for an I/O operation to finish into smaller functions, and add between them a logic that waits and checks if the I/O operation has finished, outside of our Lambda function, so we won’t pay for the function while it’s waiting!&lt;/p&gt;

&lt;p&gt;So if the wait is free on Step Functions, whatis the pricing?&lt;/p&gt;

&lt;p&gt;We pay per the transition.&lt;br&gt;&lt;br&gt;
Let’s take a look at a common use case:&lt;/p&gt;

&lt;p&gt;We triggered an operation from the Lambda function, and set a loop to check when it’s done, with a ‘WAIT’ between each check.&lt;/p&gt;

&lt;p&gt;If we want to save costs, we can define the waiting time with a greater value, which will lower the number of transitions and reduce the overall cost.&lt;/p&gt;

&lt;p&gt;For a small Step function, it’s pretty insignificant, but on a large scale, this can get expensive.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F791%2F0%2Au9M3-tJWX7BCoGQI" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F791%2F0%2Au9M3-tJWX7BCoGQI" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. Compute Saving Plan:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So what is the AWS Compute Savings Plan?&lt;/p&gt;

&lt;p&gt;You basically commit to using AWS Lambda for the next 1–3 years, and in exchange, you get a discount of up to 17% (the Compute Savings plan is also applied for EC2 and Fargate, and can reach an even greater discount of 66%).&lt;/p&gt;

&lt;p&gt;The pricing model of a Savings Plan is more flexible than RIs (Reserved Instances), as you aren’t bound to use a specific instance type or a specific region.&lt;/p&gt;

&lt;p&gt;If you’re afraid of the commitment, you can always choose the most basic option of 1 year with no upfront payment. If you’re working at a steady pace with a solid usage of Lambda functions, using saving plans should be a no-brainer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AiS_9vpDUf_VymyLr" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AiS_9vpDUf_VymyLr" width="1024" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;9. Logs — storing is cheap, writing is EXPENSIVE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Logs are crucial, we just can’t live without them.&lt;/p&gt;

&lt;p&gt;But, do we really need all of our logs? There are a few types of logs, such as DEBUG, INFO, WARN, ERROR, and FATAL, starting from the most common in decreasing order.&lt;/p&gt;

&lt;p&gt;Do we really need to write them at such a high frequency? Is any INFO message really needed?&lt;/p&gt;

&lt;p&gt;Also, if we’re using a thirdparty monitoring tool, which itself costs a lot, do we really need to write the logs to Cloudwatch as well?&lt;/p&gt;

&lt;p&gt;We need to understand that nothing is free and writing logs costs money, and with some work, we can save a lot of money!&lt;/p&gt;

&lt;p&gt;So what can you do?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ensure that only crucial logs are written. (You can do so by utilizing the &lt;a href="https://github.com/Seldaek/monolog/blob/main/src/Monolog/Handler/FingersCrossedHandler.php" rel="noopener noreferrer"&gt;FingersCrossedHandler&lt;/a&gt; library, which sends logs only when errors occur).&lt;/li&gt;
&lt;li&gt;Add a retention to delete the logs oncethey’re no longer needed (or archive them in S3 Glacier).&lt;/li&gt;
&lt;li&gt;When applicable, consider using the new &lt;a href="https://aws.amazon.com/blogs/aws/new-amazon-cloudwatch-log-class-for-infrequent-access-logs-at-a-reduced-price/" rel="noopener noreferrer"&gt;Infrequent Access tier on Cloudwatch&lt;/a&gt;, which can save you up to 50% on log group costs. (Please pay attention that it doesn’t fit every use case, as it doesn’t support real-time monitoring, metric filters, subscriptions filter, and log anomalies).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;10. VPCE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This awesome feature is not unique for a serverless-based architecture, but it’s a must-have!&lt;/p&gt;

&lt;p&gt;Basically, instead of getting out from your VPC to AWS via the NAT GW, which isn’t cheap, you can use the backbone network of AWS to connect to AWS services directly from your VPC, without traversing the public internet.&lt;/p&gt;

&lt;p&gt;This solution is more secure, efficient, and cost-effective, and you can use it with different services, such as S3, DynamoDB, ECR, EC2, Lambda, KMS, SSM and so on.&lt;/p&gt;

&lt;p&gt;This simple yet powerful feature can reduce your data processing costs and save you some money.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F768%2F0%2A50Gy7yY4loR_KI_N" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F768%2F0%2A50Gy7yY4loR_KI_N" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Cost optimization in a Serverless environment i’s (almost) all about the Lambda function.&lt;/p&gt;

&lt;p&gt;There is no doubt that this kind of cost optimization requires more effort, both from the DevOps and the Developers, and there are not that many low-hanging-fruits, but once you define guidelines in your organization, and enforce them, you will be able to save a lot of money.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fank0vsexuuggy8hzqona.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fank0vsexuuggy8hzqona.png" width="800" height="206"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;visit our career website&lt;/em&gt;&lt;/p&gt;




</description>
      <category>melioengineering</category>
      <category>aws</category>
      <category>lambda</category>
      <category>serverless</category>
    </item>
    <item>
      <title>How did we reduce our monthly AWS bills by 20% without breaking a sweat?</title>
      <dc:creator>Orel Bello</dc:creator>
      <pubDate>Wed, 05 Jun 2024 12:49:17 +0000</pubDate>
      <link>https://dev.to/aws-builders/how-did-we-reduce-our-monthly-aws-bills-by-20-without-breaking-a-sweat-1kjm</link>
      <guid>https://dev.to/aws-builders/how-did-we-reduce-our-monthly-aws-bills-by-20-without-breaking-a-sweat-1kjm</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ed2h195sl3ig1yuyuwk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ed2h195sl3ig1yuyuwk.png" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intro&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of my many tasks as a DevOps engineer in Melio was to reduce our cloud cost.&lt;/p&gt;

&lt;p&gt;Ok…it wasn’t my task, but I made it mine.&lt;/p&gt;

&lt;p&gt;I saw the enormous price we paid every month and I just couldn’t stand by, I wanted to do something about it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who am I and why do I care about cloud costs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My name is Orel Bello, and for the last year, I’ve been working as a DevOps Engineer on the SRE (Site Reliability Engineering) team in Melio. I started as a Deputy Commander in the Technological Control Center for Israel Police as part of my military service. I then completed my BS.c In computer science, and started working as a Storage And Virtualization Engineer. After a year and a half, I realized that I wanted to be a Devops engineer, and I got into my first Devops position right before I got started at Melio.&lt;/p&gt;

&lt;p&gt;Since I started using AWS, I have been paying attention to every resource price, as it is a big part of the AWS solution Architect Associate that I went through at the beginning of my cloud journey, so I knew we had a lot to cut from.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So, what’s the challenge?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As we faced a lot of challenging and more urgent tasks in our day-to-day work, reducing our cloud cost wasn’t a priority. I had to find a way to do it with minimum effort and without the help of the R&amp;amp;D.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffveccnlffsir7dy3y8py.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffveccnlffsir7dy3y8py.png" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Getting started…&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I started to dig up our bills, and I saw many different metrics, but I didn’t know what they meant.&lt;/p&gt;

&lt;p&gt;One thing caught my eye — Cloudwatch’s cost was high (about $20,000 monthly).&lt;/p&gt;

&lt;p&gt;After a little research, I discovered that we don’t have a retention policy for our log groups, so we keep them forever.&lt;/p&gt;

&lt;p&gt;I wanted to set a life-cycle policy (similar to the one S3 has natively) to set the retention policy to 3 months and then export the log groups to an S3 bucket to archive it since it’s a much cheaper storage solution. However, I was amazed to see that there wasn’t a built-in automated option for it, so I had to build one of my own (Using step function and lambdas, it was really fun to build).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does it work?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At Melio, we store log groups in CloudWatch to meet compliance requirements. However, due to the high costs associated with CloudWatch, we devised a cost-effective solution: exporting log groups to a more economical storage option — S3 buckets.&lt;/p&gt;

&lt;p&gt;We implemented a custom solution to automate this export process using AWS Step Functions triggered by an event bus. Here’s a breakdown of the process, which occurs every three months:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;DynamoDB Table Creation:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Create a DynamoDB table containing the names of all log groups. This table acts as a registry for managing the export process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Export Task Initialization:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Retrieve the last item from the DynamoDB table, initiating an export task for the corresponding log group. Subsequently, remove the item from the table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Set Retention Policy:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Apply a retention policy of 3 months to the log group that was exported successfully, ensuring that only relevant data is retained in CloudWatch&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Task Status Monitoring:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Check if the DynamoDB table is empty. If it is, the export process is complete. If not, wait for 15 minutes and monitor the status of the ongoing export task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Task Completion Check:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the export task is marked as done, start the next export task. If not, wait for 15 minutes and recheck the status.&lt;/p&gt;

&lt;p&gt;This systematic approach ensures that log groups are exported to S3, reducing costs while adhering to compliance requirements. The periodic execution every three months guarantees that only necessary data remains in CloudWatch, contributing to significant cost savings over time.&lt;/p&gt;

&lt;p&gt;After a month or two, I noticed the costs were decreasing less than anticipated. In addition to our custom solution effectively managing data retention and export, diving into CloudWatch metrics revealed another key expense: Ingested data cost.&lt;/p&gt;

&lt;p&gt;While this solution remains beneficial for those with substantial expenses on CloudWatch log groups, I felt the need to delve deeper and explore additional avenues for savings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2As3EAWfcJ1fUpAxdI" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2As3EAWfcJ1fUpAxdI" width="1024" height="864"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudwatch: the big money lies in writing, not in storing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I deep-dived into our billing metrics and saw that the price of the Ingested-Data (The writing to the log group) makes up most of our Cloudwatch’s cost, while our Stored-Bytes (The storage of the log groups) was pretty low, so I had to change the tactic.&lt;/p&gt;

&lt;p&gt;I found out that we have three log groups that produce so many logs, that each log group costs more than $ 1,500$ monthly! Luckily, those log groups are pretty common and you can also benefit from it.&lt;/p&gt;

&lt;p&gt;The first one was VPC Flow logs (Which record all the traffic that enters the VPC, useful for security and debugging purposes), which we simply modified to write logs into an S3 bucket instead of Cloudwatch (If you don’t need it you can just disable it), doing that saved us &lt;strong&gt;1,500$&lt;/strong&gt;  monthly!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudtrail, when not properly configured, is REALLY expensive&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Next, there was the Cloudtrail logs group. Cloudtrail is a useful (And expensive) service of AWS that records any action performed inside the AWS account.&lt;/p&gt;

&lt;p&gt;We had two separate Cloudtrail log groups that we simply disabled and deleted (we didn’t even need them, since they were saved in S3 and the Cloudtrail dashboard as well).&lt;/p&gt;

&lt;p&gt;And just like that, we saved &lt;strong&gt;another 4,000$&lt;/strong&gt;!&lt;/p&gt;

&lt;p&gt;After I saw how expensive the Cloudtrail log groups were, I decided to take another look at them. I found out that we have a duplicate trail, so we were paying extra — I just didn’t know how much extra. Disabling the additional trial resulted in saving &lt;strong&gt;27,000$&lt;/strong&gt; per month! We went from paying &lt;strong&gt;30,000$&lt;/strong&gt; monthly, to reducing the costs to only *&lt;em&gt;3,000$ *&lt;/em&gt; monthly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RIs and Savings Plans — the first steps toward cost optimization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the most common and simple ways to save costs is by purchasing Reserved Instances (RIs) and Saving Plans.&lt;/p&gt;

&lt;p&gt;RIs and Savings Plans are similar, but with some key differences:&lt;/p&gt;

&lt;p&gt;RIs are tied to a specific instance type in a specific region, so if you want to change a region to a different instance class mid-year, you will still be paying for the RIs you bought and are no longer using. Savings Plans, on the other hand, allow you the flexibility to switch between instance families, sizes, and OS within the same region. Both require a commitment of 1–3 years.&lt;/p&gt;

&lt;p&gt;We already had a compute Saving Plan, which saved around &lt;strong&gt;8,000$&lt;/strong&gt; per month (It’s valid for EC2, ECS, and Lambda functions, our architecture is mostly serverless so it was good for our needs). I purchased RIs for RDS (Relational Database Service), with the most basic plan (A 1-year commitment with no upfront cost, so you don’t have any reason not to use it!). Then we saved another &lt;strong&gt;10,500$&lt;/strong&gt; per month.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvic8w04a54gmsp0l4vx9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvic8w04a54gmsp0l4vx9.png" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep an eye out for unknown bills — you might be surprised&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Last but not least, I saw an odd bill for a new service, called Security Lake. It was costly (Around &lt;strong&gt;10,000$&lt;/strong&gt; per month), so I decided to check with the relevant team. The service didn’t provide enough value for them to justify its expensive price tag, so we disabled it and saved another  &lt;strong&gt;10,000$.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was the first phase of reducing our cloud costs. The rest of the savings won’t be as easy to achieve, but will be worth it!&lt;/p&gt;

&lt;p&gt;Remember that cost optimization is all about monitoring. `You should check each month that you don’t see unfamiliar bills or anomalies, and work constantly to reduce extra costs.&lt;/p&gt;

&lt;p&gt;First, you need to pinpoint your most expensive services, prioritizing quality over quantity. It’s important to choose your battles wisely, you can’t optimize all of your costs (OK, you can but some of them are not worth the trouble, so make sure to focus on the most impactful ones).&lt;/p&gt;

&lt;p&gt;It’s very satisfying to help make a difference with so little effort. I encourage you to try it yourself. Saving money for your organization can impact its growth, and you can take some of the credit for it. 🙂&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvde7tigwoprqeh7brfka.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvde7tigwoprqeh7brfka.png" width="800" height="206"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Visit our career website&lt;/em&gt;&lt;/p&gt;




</description>
      <category>costoptimization</category>
      <category>finops</category>
      <category>cloudwatch</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
