DEV Community: Orel Bello

DevOps in the age of AI: what changed, what didn’t, and what’s next

Orel Bello — Mon, 22 Jun 2026 10:45:08 +0000

Intro

The AI revolution changed how we work. And it continues to change it every day.
It’s hard to find a field that hasn’t been impacted since the launch of ChatGPT in November 2022 - and I’m not only talking about the high-tech industry.
It’s getting harder to remember how we used to write code by ourselves - back when documentation and Stack Overflow were our best friends.

And now?
It’s been a while since I’ve written a piece of code completely on my own.
That shift says a lot about how the role itself has changed.
This is what we’re going to discuss in this post: DevOps in the AI era.
What changed, what stayed the same, and where we’re heading.

I’m Orel Bello, an AWS Community Builder and a DevOps Engineer with 5 years’ experience working both before AI tools became part of our daily workflow and since.
So, do I have fewer tasks now that we’ve embraced AI?
Not at all. The opposite. The workload has only grown.
So how does that make sense?
Shouldn’t AI be making our lives easier?
That’s what we’re going to answer here.

So how did DevOps work before?

As you already know, DevOps is not a set of tools. It’s a methodology.
There isn’t a fixed stack every DevOps engineer should use, but there are shared concepts.
CI/CD pipelines, IaC, cloud, monitoring, incident management, and more.
How you implement them is up to you.
The tools change. The idea stays the same.

You wrote CI/CD pipelines by hand, with the help of documentation.
You defined infrastructure with IaC, constantly verifying changes with plan, or using external Terraform modules.
You wrote Bash scripts and Python automations through trial and error.
You run applications on compute resources, handling scaling with custom metrics.
You monitored systems using observability tools, deciding what to track based on your understanding of the system.
And when something broke, what did you do?
You debugged it yourself - logs, metrics, and yes, sometimes prints in the code to figure out where things were crashing.

So basically, you had to understand the system from end to end.

And today?

It’s different.
You don’t know something? Ask AI.
Even if it’s something specific to your system - with enough context, it can still help.
AI can help with a lot of tasks - and usually faster than we used to do them ourselves.

But the core idea hasn’t changed.
DevOps is still responsible for the same domains:
CI/CD, monitoring, automation, IaC, and cloud.
So where’s the real difference?
Let’s start with the obvious one: writing code.

I’ve always preferred writing simple code.Easy to write. Easy to read. Easy to understand. Easy to debug. But even simple code took time.
You had to search documentation, rely on your Googling skills, and hope the snippet you found would actually work.

And today?
You describe what you want - and AI generates it.
But here’s the catch.
When AI generates large amounts of code, the temptation to skip deep understanding is real.
And that’s a significant shift.
Debugging follows the same pattern - AI helps you move faster, but it also changes how you approach the problem.
One of the most important skills today isn’t just coding - it’s prompting.

You don’t need to remember every syntax detail, but you do need to clearly describe what you want and validate the result.
And it doesn’t stop at application code.
AI can generate Terraform.
AI can generate CI/CD pipelines.
You’ve had an incident?
You might still turn to logs and metrics - but now you also have AI tools (like AWS DevOps Agent) that help analyze and guide the investigation.

So, why did the workload increase?

If AI makes everything faster, why are we doing more work instead of less?
The answer comes in two parts.

Productivity has exploded Our productivity has increased dramatically.Tasks that used to take weeks or months can now be done in days. So naturally, we take on more tasks. AI didn’t reduce the amount of work. It reduced the cost of doing work. In short - we still do the same DevOps work, but faster and at a larger scale.

Let’s take a look at a real-world example:
Managing large-scale Cloudflare's configurations manually (like hundreds of DNS records, security rules, and certificates across multiple environments) used to be a long and painful process.
Migrating something like this to Terraform could take months - writing code, importing resources one by one, verifying every step.
It’s the kind of task many teams would avoid.

And today?
The same task becomes much more achievable for a single engineer - and much faster.
We’re talking days to weeks instead of months.

The scope has expanded We didn’t just speed up DevOps. We expanded it. AI introduced new capabilities - and with them, new responsibilities. Like Uncle Ben said: “With great power comes great responsibility.” AI can accelerate a lot of things - but it can also create new risks if not used carefully. You don’t even need extreme examples. Here are things I’ve personally experienced:

AI pushing directly to the main branch
Deleting code it shouldn’t
Creating excessive commits for small changes
Losing context in the middle of critical tasks, leaving you hanging
So, alongside the new capabilities, we now have new risks that need to be managed.
We now have to think about things that didn’t exist before:
Security
Prompt injection, data leakage, model access
FinOps
Every request has a cost - tokens, inference, compute
Guardrails
What the system is allowed to do and what it isn’t
Agents
Systems that act on our behalf and need boundaries
The system didn’t get simpler.
It got bigger.

New capabilities

It’s not just about doing the same work faster.
We can now do things that weren’t really feasible before.
AI agents are a good example.
Before, we built self-service tools - predefined automations for specific use cases.
Creating an RDS user, provisioning a secret, spinning up a cluster.

And now?
Instead of building a flow for every use case, we can let the user describe what they want - and let the system handle a large part of the process.
Is it perfect? No.
Does it require guardrails and validation? Absolutely.
But it changes how we think about automation.
Let’s take a look at a real-world example:
We work with multiple platforms, including AWS, CircleCI, MongoDB, and Datadog.
Tracking cost and usage trends across all of them used to be heavy manual work.
Every month, we had to:

collect data
analyze trends
find anomalies
explain cost spikes So we built an AI-based flow to handle this. Using CircleCI, we run AI-driven logic (Claude code) that integrates with each platform. It fetches data, analyzes trends, and produces a report with insights. What used to take hours of manual work every month is now automated.

What stayed the same

Even though the way we work on a daily basis has changed dramatically, the core concepts remain the same.
The scope of DevOps hasn’t been reduced by AI - if anything, it expanded.
We are still doing the same kind of work we did before.
DevOps is still DevOps.
Just running faster than ever.

AI brings new challenges, but it’s also a massive power multiplier.
You don’t really have the privilege of ignoring it.
Anyone who keeps working the same way we did 5 years ago will fall behind.
That doesn’t mean you need to try every new AI tool that comes out - that’s a fast way to burn out.
Instead, pick 1–2 tools that you use daily (for example, Cursor or Claude Code), and actually integrate them into your workflow.
You’ll be surprised how tasks you kept postponing suddenly become doable in minutes.

The problem for juniors

There is one group that this shift impacts the most: juniors.
Juniors are in a tricky spot.

On one hand, they now have the ability to work on complex tasks and build full workflows with just a few prompts.
But on the other hand, if they rely on AI without building a strong foundation, it becomes risky.
If you don’t understand the basics, you can’t really validate what AI gives you.
You won’t know when it’s correct - and more importantly, when it’s wrong.
AI is powerful, but it still needs a gatekeeper.
If you just let it generate everything without questioning it, what value are you actually bringing?
Don’t leave your judgment at the door.
Question everything.
That’s how you stay relevant in the AI era.

Where we’re heading

Let’s talk about the elephant in the room.
Will DevOps still exist 10 years from now?
There are opinions in both directions.
Honestly? No one really knows.
But for now, DevOps is here to stay, and to evolve.
Right now, I don’t see companies replacing their DevOps teams with AI.
What I do see is teams using AI to become faster and more effective.
So instead of worrying about being replaced, focus on using AI to become better at what you do.

What can we expect going forward?
More AI tools.
More agents.
More workflows that simplify complex tasks.
We’ll keep moving faster.
We’ll spend more time building things that actually drive innovation.
And yes - in some ways, it will make our lives easier.
But at the same time, it will require more responsibility, more awareness, and better decision-making on our part.

Conclusion

So did AI make our lives easier?
Yes - and no.
It made each task easier.
But it also raised the bar.
We move faster.
We build more.
We are responsible for more.
DevOps didn’t get simpler.
It got bigger.
And that’s exactly why it matters more than ever.

The 10 Commandments of Working in Production

Orel Bello — Fri, 29 May 2026 07:59:05 +0000

Intro

What scares you the most?

Some say spiders, some say clowns, but what scares Engineers (both DevOps and Developers) the most is a P0 incident, where production is down.
Want to make it even scarier? Imagine that you’re the one who’s responsible for it.

When this kind of incident happens, it’s never pleasant, but is it really inevitable?
Production incidents unfortunately happen, and at some companies, they happen more than others.

They say that you can’t be a true Senior Engineer if you don’t have a few Production incidents with your name on them, but that doesn’t mean we want to break production intentionally. We’d like to avoid it as much as possible, and even though we can’t completely eliminate it, with the right methodologies, we can definitely reduce it.

So, let’s learn how to do it, but first, let me introduce myself.

About Me

I’m Orel Bello, an AWS Community Builder and a passionate DevOps Engineer with over 4 years of experience, including the past 3 years at Melio. My tech journey began during my military service as a Deputy Commander in the Technological Control Center for the Israel Police. After earning a B.Sc. in Computer Science, I started as a Storage and Virtualization Engineer before discovering my true calling in DevOps.

Now an AWS Certified Professional in both DevOps and Solutions Architecture, I specialize in building scalable, efficient, and cost-effective cloud solutions.

So you can imagine that I have some experience as a production breaker (and also as a breakdancer, but this is for another blogpost).

WARNING! THESE RULES WERE WRITTEN IN BLOOD!

1. Always have a rollback plan

The first thing that you need to do when you’re touching a Production environment is have a rollback plan.

Let’s say that you need to modify some resource. What if this modification will cause an outage? You need to be prepared. Like they say: Hope for the best, but prepare for the worst. Better be safe than sorry.

Playbooks and documentation can save lives. So even if you’re making a small change, it’s important to prepare a rollback plan ahead of time.

Are you touching the DB? Make sure to take a Snapshot before you do.
Changing a secret, SSM Parameter, or even an IAM Policy? Make sure to save the original value in a safe place.
The examples are endless, but the concept stays the same. Always be sure to have a rollback plan in case things get messy.

2. Timing is everything

Do you have production-related work to do? If you can, always schedule it for the very beginning of the workweek, especially if you’re collaborating across time zones.That’s usually when traffic is lighter, so if your change requires downtime or carries some kind of risk, it’s safer to do it when fewer clients are actively using your system.

We also have the opposite rule that completes the circle — never perform a production change right before the weekend. In just a few hours, the entire company will be offline, and trust me, you don’t want to be the one who forces people back online to fix an issue.

For that same reason, the very end of a working day isn’t the best time for sensitive tasks, either.

3. Work on Dev before Prod — gradually

If you have no idea what a development or QA environment is, stop everything you’re doing right now and go build one.

On a best practice methodology, we always want to avoid testing features in a live Production environment.

It doesn’t matter if you’re working at a small company, you can never have just one environment that shares your development and your production workload. It’s a recipe for multiple outages and downtimes.

It’s best to have a Development/QA environment, a Staging/Pre-Prod environment and a Production environment. And once you have those environments, you can deploy your changes gradually:

First on Development
Then on Staging
Only after that on Production
This way you can handle errors and bugs before they make it to Production.

4. A wolf in sheep’s clothing — not everything is as innocent as it seems

It’s important to have a Production mindset and always think: is what I’m doing somehow affecting production? The answer isn’t always straightforward.

There are the obvious resources that you know you should be aware of, like the Database, DNS records or your compute service that runs your core production logic (EC2, K8s, Lambda functions, you name it). But you shouldn’t let your guard down so easily when you’re working on other resources.

Example (based on a true story):
The security team gives you a list of unused IAM Roles (created by CloudFormation) for more than 180 days, and tells you to handle it. So you may think that you can delete them and no harm will be done. But when you delete them, suddenly dozens of Production CloudFormation stacks can’t be deployed anymore because you deleted a resource created by them, and now they’ve drifted.

So always think twice:

Is my action touching production?
Am I absolutely sure about it?
If I’m not sure what the resource I’m dealing with is, it’s better to be cautious and to tread lightly.

5. Overcome the shame — ask for help

Oops. You did your best but you still broke prod.

Take a deep breath and relax. Don’t panic. It’s unpleasant, but it will pass.

You probably want to fix it ASAP, and the fewer people who know the better. But it’s important to overcome the shame and ask for help. It’s better that your manager hears it from you than from someone else.

Consult with your teammates and fix it together. If you try to handle it yourself without anyone else knowing, there is a chance you can actually make it worse.

Everyone makes mistakes, it’s human nature. I can guarantee you that even your CTO broke production a few times throughout his career. So don’t take it to heart, just focus on fixing it the best way you can.

6. Version control best practices — don’t take shortcuts

Don’t do shortcuts.

Do you have a small and completely safe change? Don’t be lazy. Open a PR and send it to a teammate to review before you deploy it.

NEVER. WORK. ON. MASTER/MAIN.

Developers may be limited by repository rules, and even if they want, they can’t work directly on the Master/Main branch. But DevOps usually have Admin privileges on GitHub, so if they push directly to master, no one can stop them.

Working with PRs is crucial because:

CI/CD workflows may only trigger on merge, not direct pushes
Without a PR, you lose review and can miss mistakes
Rollbacks are harder if you work directly on master

7. Root account — even scarier than a production account

You probably know that when you’re dealing with your production environment, you should pay attention and be careful.

But on the root account, you should be even more careful. The root account, if you’re using an AWS Organization, is the account that manages all the other accounts, including production.

The most common encounter DevOps Engineers have with the root account is managing the SCP (Service Control Policies). If, for example, you apply an SCP to the wrong account or detach the FullAccess Policy, you can affect all the services in all the accounts at once.

So if you’re not paying attention, you can cause an outage to your entire Organization without even noticing.

8. IaC — don’t do anything manually

Remember we talked about how it’s important not to be lazy? Don’t do anything manually on the AWS console.

IaC (Infrastructure as Code) can help you deploy changes with ease, but sometimes it takes more time to write Terraform code for a new resource than to deploy it manually. Don’t get tempted.

Why is it so important?

Easier rollbacks (since the code is in a repo)
More scalable
Consistent across environments
You can preview your changes with a plan before deploying

9. AI — powerful, but dangerous

Today, AI is everywhere, and we can’t run from it even if we tried. And while it can be a productivity boost, unfortunately, it can also cause you an outage if you’re not careful.

Whether it’s malfunctioning code that breaks your application logic, or IaC code that unintentionally deletes core resources, you need to make sure you use AI in a responsible way.

Don’t deploy untested AI-generated code to production
Don’t rely solely on AI without checking documentation
Don’t test AI code on production, that’s what Dev and Staging are for

10. Learn from your mistakes

As much as we don’t like them and want to avoid them, production incidents are a natural part of life.

If you already broke prod, try to learn from the mistake. That’s why we do retro meetings after every incident. And trust me, you won’t forget what you did that caused an outage, and that’s how you will get better.

At the end of the day, production incidents are the best teachers.

Conclusion

The harsh truth is that production incidents are here to stay, and we need to learn to live with them.

But if you follow the best practices, have a “production mindset”, and always ask yourself “Is what I’m about to do affecting production functionality?” and plan your steps accordingly, you can definitely avoid many incidents and improve your entire system uptime.

Got your own rule? Or your production-war-story? Please share it in the comments below!

The Hard Truth About Platform Engineering Adoption

Orel Bello — Mon, 23 Feb 2026 07:13:54 +0000

Intro

You know how it is. There is always this ancient struggle between doing things the fast way or the right way. Most of the time, the right way is slower, but we still have to deliver, and fast.

Platform Engineering adoption doesn't fail because of tooling.
It fails because of habits.

In a world where every request is critical and urgent (I'll never forget the developer who opened a ticket with a severity of "Production is down," saying that his personal AWS account didn't work, and it was Production for him), the struggle is real between fixing it manually because it's urgent, or investing time in building automation that will do it the right way.

So what would you do?

We're a small DevOps team, responsible for more than 200 engineers. Naturally, we started handling requests manually, tying up loose ends and eliminating blockers.
After all, nobody wants to be the one slowing everyone down.

But if you continue doing things manually just to keep up with urgent requests, in the long run you'll slow the entire company down. What works for a company of 50 engineers doesn't always scale to 200 engineers.

That's where Platform Engineering comes into play.

What Platform Engineering Actually Means

The Platform Engineering concept is pretty simple. Instead of handing developers a fish, we give them a fishing rod.

If a traditional DevOps team builds infrastructure for developers, here we give them the tools to do it themselves.

More importantly, Platform Engineering is about creating a default way of working.
Not just tools that developers can use, but paths they're expected to use.

The goal is to eliminate bottlenecks and accelerate innovation. If the DevOps team can't take a day off without the company going up in flames, something probably needs to change.

We need to adopt an enablement mindset. How can we give developers tools to work independently, while still keeping the organization's best practices?

Remember, the wisdom is to find ways to allow, not to block. Although, as we will see later on, sometimes blocking is inevitable.

First Steps

The first thing we did was build a self-service platform, which we call the Buffet.

Now, whenever developers need to create a new MySQL user, MongoDB cluster, or even a secret (and many more different resources), they can do it completely automatically, without waiting for DevOps, by using the Buffet (which we implemented as a Slack bot).

It's a win-win. Developers move much faster without waiting for DevOps, and DevOps has more capacity to focus on real work instead of manual support and acting as a help desk.

But is that all? No.

Self-service solves symptoms. It doesn't solve standardization.
Platform Engineering doesn't end here, and this is also where the real problems usually start.

The Challenges We Didn't Expect

Saying No

Migrating to a self-service portal instead of handling every request manually isn't always smooth. This is where DevOps needs to start saying "No."

Example:

"Can you please create me a MySQL user? It's urgent."
"We're moving all MySQL users to the self-service platform. If we keep doing this manually, we'll never finish building it. You'll need to use the Buffet."

This is usually the point where developers start feeling that DevOps is blocking them instead of helping.
But in order to invest in innovation, you need to stop doing things manually, even if it means developers will temporarily have to wait.

Short-term friction. Long-term acceleration.

It's not always pleasant, but it's necessary.

Lack of Standard

That was just the tip of the iceberg. What came next hit us harder than expected.
The root of almost all our headaches was lack of standardization.
Each team did things their own way, which made organization-wide improvements painful. That included:

Architecture
Not all services followed organization or AWS best practices.

Deployment
Teams wrote CI/CD pipelines differently, ran different pre-deploy checks, and deployed to AWS in their own ways.

Monitoring
Alarms, custom metrics, and dashboards varied widely.

Permissions
Access was inconsistent and sometimes overly permissive.

And that was just the beginning.

Every change felt risky, manual, and error-prone.

The real kicker was that making an organization-wide change required touching every repository individually. This created friction, extra manual work, and a high risk of mistakes.

Unicorn Startup Pressure

Doing all this while scaling a unicorn startup in a race for acquisition added even more pressure.

Legacy services, tight deadlines, and a high-growth environment made the transition especially tricky. There was no clean slate.

Everything had to keep working while we improved it.

So What Did We Do?

We tackled the most impactful challenges first.
Creating a self-service platform immediately eliminated a lot of manual work.

But as good as the self-service platform was, it handled only one aspect of Platform Engineering. Our biggest challenge remained lack of standardization.

Building the Golden Path

We started creating a Golden Path, our organization's right way of doing things.

Our backend-platform team built the MDK (Melio Development Kit) - an internal opinionated CLI that generates and enforces AWS SAM service templates. Similar in spirit to CDK, it helps developers create a standardized template.yaml, which we use to deploy our services.
It wasn't only about building templates faster, although writing SAM templates manually is never fun.

More importantly, it finally allowed us to define how a service should look.

Let that sink in for a moment. It's a big deal.

With the MDK, we unlocked many opportunities:

Set best practices for AWS architecture, security, logging, tagging, and FinOps
No more wide IAM permissions
No more SQS without DLQs
No more using highly expensive resources without a reason

From the developer side, this also meant less guesswork and fewer decisions when starting a new service.

And the best thing?

Want to add a new feature across the organization? No more opening pull requests on hundreds of repositories, each one looking different.

Just open one pull request.

Or so we thought.

Adoption Challenges

Developers were comfortable with how they used to work and weren't excited about migrating their services to the MDK.

The naive solution was setting this as a cross-organization initiative, prioritizing it with product managers, and working closely with R&D.

In practice, enforcement became necessary.

The harsh truth is that you can only get so far by asking nicely. To really move forward, you have to define guidelines and enforce them.

Enforcement can happen at multiple levels:

Infrastructure guardrails (for example, AWS Service Control Policies)
Deployment blocking for non-compliant services
Clear deprecation timelines for old versions

Example: announce that developers have one month to start using the MDK. Other deployment methods will be deprecated and blocked. Teams that don't migrate won't be able to deploy new versions.

It sounds aggressive, but this is often the only way to make progress at scale.

The same applies to versions. If developers keep using an old version of the MDK, new features won't help. Deprecation and enforced upgrades are necessary.

Where We Are Now

Today, with both the MDK and the Buffet, which we continuously improve, we're on the right track. But there is still a long way to go.

One of the clearest examples of why standardization matters is tagging.

It may look insignificant, but tagging unlocks many capabilities. From ABAC-based permissions, which are critical for least privilege access, to cost allocation per team, and easily finding owners of budget-eating services, tagging is the foundation for everything.

When moving to a Platform Engineering approach, we always need to operate on two paths in parallel.

We must define guidelines and enforce them. For example, all new services must include predefined tags, and non-compliant deployments should be blocked (AWS Config can help here).

At the same time, we must migrate existing services, which often takes much longer.

The same principle applies to CI/CD, monitoring and observability, and yes, also AI.

We live in a world with constant AI FOMO. Implementing everything immediately leads to dis-standardization, which is the enemy of Platform Engineering.

We need to choose the right tools, define guidelines, and invest in proper rollout with training sessions, tutorials, documentation, and even hackathons.

Just like DevOps, Platform Engineering is a mindset and a methodology. It should be reflected everywhere.
Security, AI, FinOps, CI/CD, monitoring. The same approach applies to all of them.

Enable, don't block. But make the right way the easiest way.

Lessons Learned and Conclusion

Platform Engineering is never done. It's an ongoing journey.
The most important rule is standardization. Define guidelines and enforce them. That's the key.

Don't do the work for developers. Give them the tools to do it themselves.

Always prioritize building self-service tools over manual, repetitive work, no matter how urgent it feels, unless it's truly P0.

Remember that the self-service platform is only part of the story. As big as it is, it's not the whole picture.

Start as early as you can. It will have a massive impact later, when you need to support many legacy services that don't follow organizational guidelines.

The beginning will be hard, but it pays off. Platform Engineering boosts productivity, even if at first it slows you down.

If DevOps scales people, Platform Engineering scales standards.

And just as important, explain to developers why you're doing this and how it benefits them. You'll need their cooperation.

The Problem With Cross-Account DB Access (And How Data API Solved It)

Orel Bello — Fri, 09 Jan 2026 09:34:48 +0000

So what is Data API and how can it help you?

You know how it goes.
You get a support ticket from a developer: “Please create a DB user for my service.”
Or worse: “I need to query the DB, can you create me a personal user?”

It’s not the most exciting task a DevOps engineer can get, but it’s critical for keeping the business running. And when this process does not scale, DevOps quickly becomes a bottleneck.

So we built a self-service mechanism. A developer can simply use a Slack bot to request a database user, without involving DevOps.
Win for the developer, win for DevOps.

Under the hood, the architecture was actually pretty solid.
We stored the desired users list in an S3 bucket.
Each change triggered an event notification to a central SNS topic, which kicked off a Step Function in every AWS account.
From there, Lambda functions handled the heavy lifting: creating the DB user with the right permissions (read-only, read-write, admin), generating the secret, and even registering it in the database proxy.

On paper, this was robust. In reality, distributed systems are never perfect.

Sometimes an event was delayed. Sometimes a Lambda failed. Sometimes permissions drifted. And occasionally, a user was simply not created.

The real problem was not the failure itself.
It was visibility.

From the developer’s perspective, they clicked a button in Slack and… nothing happened. No clear feedback, no way to validate the outcome.

That is when we decided to build a Validator workflow.

The idea sounded simple: create a Lambda function that checks whether the DB user exists and report the result back to the user.

But then reality hit again.

We needed to validate users across multiple environments, multiple AWS accounts, and multiple databases. And as we all know, connecting to a database usually means one thing: network access from inside the VPC.

In a multi-account setup, that left us with two main options.

The first option was VPC peering. Technically possible, but not something we were comfortable with. We did not want to expose production VPCs to other environments just for validation purposes.

The second option was to deploy a Lambda function in every account and trigger it cross-account. That worked, but now we were managing dozens of Lambdas, IAM roles, permissions, and invocation logic. The validator itself was becoming another distributed system.

At that point, we stepped back and asked a simple question:

What if there was a way to query all of our databases from a central account, without VPC peering and without deploying Lambda functions everywhere?

Turns out, there is.
And it is called Data API.

Data API is a feature of Amazon Aurora that allows you to interact with a database using API calls, without requiring direct network connectivity to the database VPC. No security groups, no subnets, no peering. Just IAM-authenticated API calls.

This changes the game.

Instead of running validation logic inside every VPC, we could run a single Lambda in a central account and query each database directly using Data API. Same code, same workflow, no networking complexity.

Under the hood, Data API is also what powers the RDS Query Editor in the AWS console. When you run a query from the UI, you are already using it, whether you realize it or not.

Cross-account access (the part that actually matters)
Because the validator runs in a central account, we still needed a secure way to access databases that live in other environments.

Data API removes the networking requirement, but IAM is still enforced.

In practice, this meant creating a cross-account IAM role in each environment. The central account is allowed to assume this role, and the role itself has permissions to call the Data API on the local Aurora cluster.

We deployed this role to all environments using Terraform, so every account followed the same trust policy and permission boundaries. No manual setup, no snowflakes.

From the validator’s perspective, the flow is simple:

Assume the role in the target account
Call the Data API
Run the validation query
Return the result to the user
Data API solves the networking problem.
Cross-account IAM roles solve the permissions problem.

Together, they let us centralize access without compromising security.

When Data API actually makes sense
Data API is not a silver bullet. It has throughput limits, different latency characteristics, and it only works with Aurora. You would not use it for high-volume application traffic.

But for control-plane operations like validation, auditing, administrative workflows, and platform automation, it is extremely powerful.

In our case, Data API allowed us to reduce infrastructure complexity, standardize access across environments, and give developers fast, reliable feedback without pulling DevOps into every request.

Sometimes the best solution is not adding more infrastructure, but removing it.

From Bare Metal to Serverless: How to Evolve Your Disaster Recovery Strategy

Orel Bello — Wed, 23 Jul 2025 12:13:49 +0000

Intro

Imagine this scenario:
You’re working in a successful and even profitable company, you’re using the latest cutting-edges technologies out there, you’re feeling good. Things can’t get any better than this.

But one day, you wake up in the morning to 10+ missed calls and dozens of messages that yell “Production is Down!!!”.

You found out that a disaster has occurred (your Data Center was set on fire, there was a regional electricity outage — you name it!), and your entire system is down.

You are probably thinking by now, “Those are on-prem problems, I’m using AWS — I have nothing to worry about!”

But what happens if a hacker encrypts your entire environment? Or maybe you used an LLM-generated code that accidentally deleted the Production Database?
Or even a lighter option — AWS’s entire region is down?

What can you do?

That’s where Disaster Recovery comes into play.

About Me
I’m Orel Bello, an AWS Community Builder and a passionate DevOps Engineer with over 4 years of experience, including the past 3 years at Melio. My tech journey began during my military service as a Deputy Commander in the Technological Control Center for the Israel Police. After earning a B.Sc. in Computer Science, I started as a Storage and Virtualization Engineer before discovering my true calling in DevOps. Now an AWS Certified Professional in both DevOps and Solutions Architecture, I specialize in building scalable, efficient, and cost-effective cloud solutions.

One thing you should know about Melio is that our entire architecture is fully serverless. We run a large-scale environment of Lambda functions, and naturally, Lambda has become our go-to solution for nearly every challenge we need to address.

What is Disaster Recovery?

Disaster recovery (or DR for short) is what it sounds like: recovering from a disaster. There are a lot of use cases that fall into this category.

The bottom line is that you need to define a workflow to get your application up and running when your main site is down.

Your Disaster Recovery Plan (DRP) shouldn’t be a separate initiative, it needs to be fully integrated with your architecture and application logic.

Before you start building your DRP, you need to decide on your desired recovery time objective (RTO) and recovery point objective (RPO).

RTO and RPO
RTO and RPO are the main components when designing a DRP, they will decide your DR strategy.

Let me explain:

RTO is the amount of downtime you have. How long will your system be down?
RPO is the amount of data loss that you are willing to endure. What can you afford to lose? Lower RTO and RPO = less data loss and less downtime, but also (probably) a more expensive DRP.

(Many aspects can affect our decision to choose our desired RPO and RTO, like KPIs, SLAs, or our commitment to our clients and partners).

For example:
DRP with RTO of 5 hours and RPO of 15 minutes means that we will have a data loss of up to 15 minutes (for example, by taking a scheduled snapshot every 15 minutes), and it will take up to 5 hours to get our system up and running again.

Rollback or cutover?

One more thing that you need to consider when designing a DRP is Rollback or Cutover.

Let’s say we’re designing a DRP for an entire regional outage on AWS, we’ve initiated a failover to our backup region, and what happens when our main region is back online?

Should we go back to our main region, or stay in the new one?

If we’re dealing with a hacker who encrypted our entire region, we may not have a main region to go back to.

So it’s really important to ask ourselves those questions before defining our DRP; the answer to those questions will determine our strategy.

How does DR work in an on-prem situation?

OK, now we know what a DR is a little better, but before we jump into DR on cloud, let’s get back to the basics.

How is the good, old-fashioned DR working on-prem?

On AWS, we can just put a new DR with a few clicks (yes, I know I’m exaggerating)

But on-prem, that’s a whole different story.

We have to plan ahead and run our entire workload accordingly.

What does this mean? Let’s start with a real-life example.

A while ago, when I served in the Israel Police, we needed a DR for the Israeli 911 emergency center, and the cloud wasn’t an option.

So we needed to build a new emergency center from scratch, in a different, physically isolated place, with all the required equipment (computers, phones, communications devices — you name it!) and it may seem like a different use case which has nothing to do with a Cloud DR, but the basic principles are all the same when you come to design a DRP for the cloud.

I wanted to understand how DR behaves on-prem, so I met with a Director of Storage Architecture to shed some light on it.

When designing a DRP on-prem, we have two main methods:

The first one is to build a DR site within 300 meters of the main one, using FC cables.
The second one is compatible with a 10 km radius, with a single cable running through the sites. (Some solutions support extended distances beyond 10 km, but with higher complexity and various downsides, it’s out of scope for this blog post.)

When talking about DR on-prem, we also need to choose:
Do we want a failover DR, which will be activated only when a disaster occurs?

Or do we want to utilize our DRP to be fast and resilient, and run on an active-active architecture?

Active-active means that we have both our main site and backup site working at the same time!

When using active-active, we want each site to be able to withstand all the traffic routed from the failed site when a disaster occurs. So every site must run only 50% of its capacity at each given time. When needed, it will be able to run 100% with double the traffic (failed site + back site, all at once!), and it’s a huge waste of underutilized resources!

We have some serious trade-offs here, but on the cloud is it really better?

Different strategies and approaches for DR

So, when talking about an AWS DRP, we have a few strategies:

Backup and restore, pilot light, warm standby, and multi-site, from the lowest cost and the poorest RTO, to the most expensive with the minimal RTO.

Let’s break them down:

Backup and restore:
This is the most basic one and pretty straightforward:

We take snapshots from our RDS every X time (And from our EC2 or Container images — depending on what computing services we are using) and save them on our backup region.

Pros: It’s the simplest and cheapest one.

Cons: When we face a disaster, we will need to deploy all of our services from scratch and restore our RDS from a snapshot, which will result in a longer downtime.

Pilot light:
Similar to backup and restore, but here we keep our core functionality up and running on our backup region, so when we need to initialize a failover to the backup region, it will be faster.

Of course, as we said before, we get better RTO and the price goes up as well.

Warm standby:
Here we don’t only have our core functionality up and running on our backup region, but our entire scaled-down system is running on our backup region.

So when a disaster occurs, we just need to scale up our backup environment instead of deploying it from scratch!

Multi-site:
Here we have an active-active architecture, we have both our main region and the backup region running our fully scaled-up workload!
This method requires a different approach, and is harder to maintain; think about it, now you have twice the production to give you a headache!

But you get the ultimate RTO and RPO! You’re always live, and your users won’t be able to tell the difference if your main site is down — you will just need to have double the budget.

How is a serverless DR different from traditional DR?

So we learned about DR on-prem and on AWS, but what about serverless?

Here’s where things get interesting.

The Trick: Pre-Deployment Without Paying for Idle
We can get the benefits of an active-active approach (like minimum RTO), but here’s the trick — we won’t pay for most of our backup resources as long as we don’t use them!

Keeping Environments in Sync with Stack Sets
So, we can deploy our services ahead of time, making sure they will be ready to serve traffic immediately when needed, but we won’t need to pay for the time they’re IDLE!

Now we can keep both of our environments up to date by deploying to both of them regularly at the same time with CloudFormation Stack Sets, which allows us to deploy a CloudFormation stack to multiple regions and even multiple accounts!

Now all of our serverless components are deployed and ready for action, but we have many more resources to take care of, depending on how robust we want our solution to be.

Let’s tackle them one by one, starting with the most important one — our database (DB)!

Let’s Talk About Databases
Without DB, we practically don’t have anything, so it’s one of (if not the most) crucial aspects to pay attention to when designing a DR.

Like we’ve seen before, we have many different approaches we can take, depending on the trade-off we want between the RTO and the cost.

High Availability DB Options
We can start with cross-region snapshots, to cross-region read replicas (and promote them to master when a disaster occurs), and finally, Aurora Global DB (Or DynamoDB if you are using a NoSQL DB).

Aurora Global Database ensures rapid recovery (< 1 minute RTO = downtime) with minimal data loss (RPO of 1 second), enabling robust business continuity.

But despite all that — There is a potential data loss of 1 second of writing because of the Asynchrony replication (if the DB itself is ok, the data will be available on the original cluster as soon as it recovers — like when the entire region is down), so it’s important to pay attention to this.

Aurora Global DB is sure great! But even if you’re using it (Or any other DB), it’s still extremely important to set up an immutable backup!

Immutable Backups: Your Last Line of Defense
You can do so by using AWS Backup, which supports it natively.
The purpose is to have a backup that no one can change or delete! So if, for example, a hacker got into your system, and encrypted/deleted your entire data, you will still have your immutable data to recover from! (It’s recommended to save this copy on another region or even another account!).

Ok, back to our Serverless DRP.

How Do We Know the Region Is Down?
How does our system even know that our main environment is down? We can’t use any regional service (like an ELB) for this purpose since they will be down as well if our entire region is on outage; so we have to use a global service — Route53.

We can set a failover routing policy with health checks and enable automatic failover to our backup region whenever our main environment becomes unavailable to ensure a seamless failover mechanism (and even trigger a Cloudwatch alarm to trigger different actions we need to take when our main site is down).

Other Key Services: S3 and CloudFront
Ok, but what about other services? Like S3 bucket or Cloudfront?

For S3 bucket, it’s pretty simple — we can set up a cross-region replication to our backup region, and all new files will be replicated to the new bucket!

In Cloudfront distribution, we can set a failover origin, and whenever our main region becomes unavailable, we will automatically route requests to the other distribution.

But serverless is not just about saving money, it’s about high availability too!

Serverless = Built-in High Availability
When we are using traditional computing services, like an EC2, we are bound to a specific AZ, and even if just an AZ will experience an outage — and not the entire region — our system will still be down.

When using serverless, it’s no longer a concern!
Since we’re using Lambda functions in addition to managed services (Like API Gateway in integration with SQS and SNS, which it’s a common practice to use them in a serverless architecture), we get the Multi-AZ feature natively!

Conclusion

You’ve seen what a DR is and how it’s being implemented on-prem, different approaches to DR on Cloud, and finally — how DR is taken to a whole other level when dealing with serverless!

DRs are now easier than ever to implement, working automatically, and also cheaper!

DR is one of the most important aspects of your workflow, it’s like insurance — you can not have one for years and nothing will happen, but as soon as something bad happens, you don’t want to be caught without one.

So whether you choose backup and restore, pilot light, warm standby, or a multi-site, it doesn’t matter — as long as you make sure to implement one!

In a best case scenario, a DR will add to your monthly billing, and you won’t see it giving value most of the time. In a worst case scenario it can save your company’s time, money, and reputation when a disaster occurs.

Orel Bello
DevOps Platform Engineer @ Melio | AWS Community Builder
Passionate about scaling DevOps with simplicity and impact.

Building Self Service… using Self Service?

Orel Bello — Sun, 01 Jun 2025 10:31:26 +0000

In Platform Engineering, our mission is clear:
Build tools that help developers move fast and independently - without waiting for DevOps or becoming a bottleneck.

At Melio, a team of just 5 DevOps supports over 200 developers. That scale leaves no room for manual work. Self-service isn’t a nice-to-have - it’s survival.

Usually, building self-service starts by identifying repetitive requests and automating them. But what if we could take it one step further?

What if we could automate the automation itself?

One of the most common needs at Melio is creating Self-Service Runners - little automations developers can trigger on demand.

Each runner used to require a bunch of steps:

Cloning a GitHub template
Customizing SAM template
Updating Lambda code to handle parameters and logic
Creating a Slack modal for input
Hooking it all into CircleCI and deploying

That’s a lot. Too much.
So… we built a runner that builds runners.

With just a few inputs - runner name, a JSON schema for inputs, and optional Terraform config - this tool does it all:

Spins up a GitHub repo from a template
Opens a PR with all code changes
Builds the Slack Modal automatically
Wires it all to CI/CD and deploys

Think of it like a buffet for infrastructure.
Developers choose what they need, and automation serves it up instantly.

And the best part?
We use Bedrock to inject parameters dynamically into Terraform files - no more writing custom logic for every use case.

We’re not just building tools.
We’re building tools that build tools.
That’s what Platform Engineering looks like when AI becomes part of the stack.

We didn’t reinvent the wheel. No complex systems, no massive overhead. Just practical automation that scales - fast.

The Buffet: Behind the Scenes
We call our self-service portal The Buffet because it empowers developers to “serve themselves” — instantly and independently. Whether it’s provisioning AWS resources, spinning up a database, or managing secrets, developers just make a request and automation takes care of the rest.

It's built around a simple but powerful backbone: Lambda, SNS, SQS, GitHub PRs, and Terraform via Env0.
It’s not flashy — but it works incredibly well.

Since introducing The Buffet, we’ve offloaded hundreds of support tickets.
DevOps interruptions are down, and developer autonomy is way up.

Final Thoughts
The result?
With just a few clicks, we can spin up a fully functional self-service runner — production-ready and developer-friendly.

It’s not just faster.
It’s a complete shift in how we support scale.
And it’s already boosting productivity across the board.

Would love to hear how others are approaching self-service at scale. Feel free to comment or connect 🙌

Orel Bello
DevOps Platform Engineer @ Melio | AWS Community Builder
Passionate about scaling DevOps with simplicity and impact.

Unlocking Efficiency Through Lambda-Powered Workflows

Orel Bello — Sun, 20 Apr 2025 12:28:40 +0000

Struggling to balance support tickets and innovation? Discover how a small DevOps team leverages simple Lambda-powered workflows to empower 200+ developers and unlock massive efficiency.

Introduction

How do you manage endless support tickets while still focusing on innovation?

Not every task we handle is thrilling or exciting. Not every task is blog-post material. Sometimes, we deal with less glamorous missions, like saving money on CloudWatch log storage, offboarding a developer, securing access to a sensitive S3 bucket, disabling unused IAM roles, implementing a code freeze solution, and more.

And it doesn’t stop there — sometimes, we even get support tickets that seem endless: granting missing IAM permissions, creating MongoDB or RDS clusters and users, setting up AWS Personal Accounts, creating ECR repositories, secrets, and so much more.

How can just one DevOps, or even a few, manage all of this in addition to daily tasks?

Here’s where it gets interesting: we’re a team of only 5 DevOps, responsible for over 200 developers.

What if there were a simple way to automate these tasks or, even better, empower developers to handle them on their own?

Well, what if I told you there is a way? A simple way.

If I were to give this blog post another title, it would be The Power of Simplicity.

You won’t find a complex architecture with thousands of lines of code here. Instead, you’ll see the most basic and straightforward solutions — the kind that are often the most effective.

About Me

Before we dive in, let me introduce myself.

I’m Orel Bello, an AWS Community Builder and a passionate DevOps Platform Engineer with over 3.5 years of experience, including the past 2.5 years at Melio. My tech journey began during my military service as a Deputy Commander in the Technological Control Center for the Israel Police. After earning a B.Sc. in Computer Science, I started as a Storage and Virtualization Engineer before discovering my true calling in DevOps. Now an AWS Certified Professional in both DevOps and Solutions Architect, I specialize in building scalable, efficient, and cost-effective cloud solutions.

Understanding Lambda Functions

Let’s take a look at what Lambda functions are and how they can help us boost efficiency through automation.

We’re all familiar with Lambda functions — the serverless compute service that lets you focus on writing code instead of managing servers.

Lambda integrates natively with many AWS services, making it the perfect tool for automation.

You can trigger Lambda functions on demand, or by a variety of AWS services like EventBridge, SNS, SQS, API Gateway, and many more.

And the best part? You don’t need to be an expert developer to write automations. All you need is a solid understanding of basic Python and the legendary boto3 library.

Boto3 is the engine behind the AWS CLI that we all know and love. It lets you perform actions on AWS with ease.

And here’s the kicker — it’s already included in Lambda, so no additional layer is required!

So, what can you do with it?

Basically — everything!

Let me show you just how simple it can be.

Use Case 1: Implementing a Codefreeze Solution

Let’s talk about the Code Freeze.

We always want our production environment to be stable and error-free. But there are certain critical periods, like when we’re presenting a live demo to partners, where we can’t afford the risk of a developer accidentally deploying to production and causing issues. During these times, we need to block all deployments based on a schedule automatically — and, most importantly, make it easy to enable or disable the block if a hotfix is needed in production.

Here’s the simplest solution for this:

Let’s break it down into three parts:

Scheduling — For scheduling, we can use EventBridge, which allows us to use CRON expressions to trigger our Lambda function at specific times.
Blocking — Since all of our services are deployed through CloudFormation stacks, blocking all deployments is as simple as denying CloudFormation actions (Such as CreateStack and UpdateStack). We can achieve this using SCP (Service Control Policies).
Lambda — This is the bridge between EventBridge and SCP.

In short, we write a simple Lambda function to attach the SCP policy and trigger it using EventBridge (And of course, another lambda function and Eventbridge to disable the code freeze), It’s as easy as that!

Automating the code freeze mechanism not only helps safeguard stability but also simplifies the process and reduces the chances of human error during those critical times.

Use Case 2: Developer Offboarding Automation

Alright, that was simple, but what about offboarding a developer?

At Melio, every developer has a Personal AWS Account and a Personal Atlas MongoDB cluster. When they leave the company, we need to delete these resources for two key reasons:

Security : We want to make sure no backdoors are left open.
Cost Optimization : Resources that are no longer in use should be terminated.

Don’t worry, it’s just as straightforward as before.

The first step is to use EventBridge integrated with CloudTrail to capture the DisableUser event, which tells us a developer has left the company.

Next, we need to clean up the AWS resources before closing the account.

Why not just close the account right away? We deploy third-party resources, like the Twingate connector, when creating a personal AWS account. We’ll need to run a terraform destroy before closing the account to terminate those external resources.

How do we do this?

We simply send an API request (using the requests library, so we’ll need a Lambda layer for that) to Env0 (our Terraform platform). Once the destroy operation is complete (we can implement a simple wait mechanism with a step function), we close the AWS account with a basic Boto3 command. Afterward, we make an API call to MongoDB to delete the cluster, and that’s it.

It’s an easy workflow, and aside from the additional Lambda layer for the Python requests library, everything else is native to AWS.

Use Case 3: CloudWatch Logs Cost Optimization

Let’s look at one more use case.

At Melio, we store log groups in CloudWatch to meet compliance requirements. However, CloudWatch can be expensive, so we came up with a more cost-effective solution: exporting log groups to S3, which is a much cheaper storage option.

The catch? There isn’t a native way to do this automatically, like with the lifecycle rule for S3 buckets, so we had to build our own solution.

Let’s break it down:

1. DynamoDB Table Creation:

Create a DynamoDB table containing the names of all log groups. This table acts as a registry for managing the export process.

2. Export Task Initialization:

Retrieve the last item from the DynamoDB table, initiating an export task for the corresponding log group. Subsequently, remove the item from the table.

3. Set Retention Policy:

Apply a retention policy of 3 months to the log group that was exported successfully, ensuring that only relevant data is retained in CloudWatch

4. Task Status Monitoring:

Check if the DynamoDB table is empty. If it is, the export process is complete. If not, wait for 15 minutes and monitor the status of the ongoing export task.

5. Task Completion Check:

If the export task is marked as done, start the next export task. If not, wait for 15 minutes and recheck the status.

We created a systematic approach to ensure log groups are exported to S3, reducing costs while still meeting compliance. The process runs periodically — every three months — ensuring that only the necessary data stays in CloudWatch. This results in significant cost savings over time while still staying compliant with our requirements.

The Buffet: A Self-Service Solution

While Lambda saves time through automation, how can we address on-demand developer requests without creating bottlenecks?

That’s where The Buffet comes in — a self-service portal powered by Lambda functions.

The Buffet empowers developers to work more efficiently without waiting for DevOps, removing the bottleneck and allowing them to perform tasks independently. It’s all about making their lives easier and letting them do what they need to do, without any dependency on DevOps.

How It Works

We’ve set up an interface where developers can submit their requests (we use Slack, but you can use any tool you prefer).

Once a request is made, it’s sent via API Gateway into our AWS account. From there, we trigger an SNS topic, which sends the request to multiple SQS queues — one for each runner (i.e., self-service action). The relevant Lambda function pulls from the SQS queue and performs the actions.

Implementation Details

That covers the infrastructure, but what about the logic for the runners?

It’s simpler than you might think.

We’ve identified the most frequently requested tasks and automated them. These are often day-one operations, like creating AWS personal accounts, ECR repositories, Secrets, RDS clusters, MongoDB clusters, and more.

What do all of these tasks have in common?

They all create resources using Terraform. And since the Terraform code is stored in a Git repository, we just fetch the relevant file, append the new resource, create a pull request, and after the merge, Env0 applies the changes.

This simple but powerful architecture allows us to automate the creation of resources and easily add new runners without hassle.

Lambda function → Modify the Terraform repo → Create a PR → Apply with Env0.

Benefits

Using Buffet is a win-win for everyone.

Developers no longer need to wait on DevOps for support requests and can focus solely on development, free from bottlenecks. Meanwhile, DevOps can shift focus to more impactful tasks instead of handling repetitive support.

Creating a Self-Service portal can significantly ease the day-to-day load on DevOps and streamline workflows for everyone.

While it does require effort, and building new runners is simple, creating the portal itself will take some time.

However, it can empower your team and skyrocket productivity. The impact can be so huge, it’s like adding a new DevOps engineer to your team, handling the heavy lifting!

Final Thoughts

From a simple code freeze mechanism to comprehensive workflows, Lambda functions empower DevOps engineers to streamline their processes. Whether it’s using EventBridge for triggers, Step Functions for orchestration, or Slack for user interfaces, these tools make balancing efficiency and simplicity feel effortless.

Ready to simplify your workflows? Start small — automate just one task and watch the impact it has. With every step forward, you’ll uncover the incredible power of simplicity.

Visit our career website

The journey to your first Tech Role

Orel Bello — Sun, 20 Apr 2025 10:15:23 +0000

So you’ve just finished your bachelor’s degree. Now what? With so many different fields in the industry, how can you choose what type of role to pursue? Do you know all the roles that are out there? It might be natural to try and apply to every available position out there–just to get in. And while there isn’t a single correct answer, there are some crucial aspects you’ll need to pay attention to before you apply–and choose the right role for you.

First, let’s begin with my own journey. I’m Orel Bello, an AWS Community Builder and a passionate DevOps Engineer.
My tech journey began during my military service as a Deputy Commander in the Technological Control Center for the Israel Police. After earning a B.Sc. in Computer Science, I started as a Storage and Virtualization Engineer before discovering my true calling in DevOps. Now an AWS Certified Professional in both DevOps and Solutions Architect, I specialize in building scalable, efficient, and cost-effective cloud solutions.

I have a passion for assisting individuals in finding their next position, and through this blog post, I aspire to reach and help as many people as possible.

Now, let’s dive in.

Self Confidence:

The primary thing you need when job-seeking is self-confidence. One common sentence I often hear from individuals attempting to enter the High Tech industry is, “Why should anyone hire me? What do I have to offer?”

Let HR decide if you’re a good fit for the job; don’t do it for them. Believe in yourself and take pride in your accomplishments; don’t underestimate their value.

Preferred Field:

Now, it’s time to choose your preferred field. It’s okay if you don’t have one yet. Consider two essential aspects:

What advantages do you have compared to others? Think about your unique experiences, such as self-projects, military service, or Udemy courses, which may not be traditionally defined as experience but are valuable nonetheless.
Identify what you enjoy doing and what you excel at. If you can think of several fields of interest, that’s a good starting point. Make sure to research these fields thoroughly, and be ready with an answer when the interviewer asks why you want to be an X? If you won’t have a good answer, trust me, they will know.

Resume:

Create multiple versions of your CV for each position. For instance, if your technological stack includes C++, Java, Assembly, Python, and Android development, and you’re applying for a Data Scientist position, many of these skills might be irrelevant. It’s best to focus on Python and provide more detail about it, rather than adding unrelated programming languages to your resume.

Apply to every related position, even if you meet only 30% of the requirements. Don’t hesitate just because you’re missing a few details; apply anyway. After the process, if they want you, you’ll be in a strong position to negotiate.

LinkedIn:

One of the first challenges you will encounter will be getting an interview. So, ensure you have a proper LinkedIn profile. Here are some tips:

Start with a good profile picture and a valid title (if you’re currently unemployed, you can use a future position title, like ‘Junior XX’).
Connect with individuals related to your preferred fields, such as people working in companies where you’d like to work, senior programmers in your desired position, or HR professionals from various companies.
Aim for at least 500 connections, but focus on making valid connections rather than adding everyone you come across.
Provide details about your technical skills, technological stack, prior experience (even if it doesn’t directly relate to your preferred fields), education, courses, and certifications.

Interviews:

Interviews can be quite intimidating at first, but with enough experience, you’ll discover that most of them follow a similar pattern, and you’ll be able to handle them almost effortlessly. Your elevator speech is crucial and should be a 60-second introduction where you persuade the interviewer to hire you. It should include a brief summary of your experience, knowledge, strengths, and what you’ll bring to the position.

Prepare a project you’re proud of, whether it’s from previous roles, college, or personal time, and be ready to discuss it. Anticipate questions like ‘Why did you choose to implement it that way?’ You’ll be asked general knowledge questions, which you can practice on platforms like Glassdoor. Or, you might receive a situational question for which you’ll need to debug a problem or propose a solution.

If you’re uncertain about an answer to a general knowledge question, it’s acceptable to admit it, but avoid making something up. For situational questions, it’s recommended to think out loud, saying something like “Hmm, let’s see” or “Let’s think together,” and then propose possible solutions.

Gain experience through interviews. Each one will enhance your chances, knowledge, and self-confidence. Even if an interview goes poorly, view it as a learning experience to improve for the next time.

Improve yourself:

What can you do until you get your first position? In Israel, it may take 6 months up to a year and a half to find your first job [Hebrew]. In the meantime, consider these steps to make yourself more marketable:

Certifications: In the IT and DevOps fields, certifications are common and can boost your resume. Look for certifications in your spare time, like AWS, to expand your knowledge and skills. Some certifications are valid for a lifetime, while others last for three years. There are different levels, such as Associate, Specialist, or Professional.
Projects and hands-on experience: Work on projects and gain hands-on experience. Showcase your projects on GitHub to demonstrate your skills to potential employers.
Networking: Attend meetups to learn and meet professionals from the industry. Networking can lead to new connections and opportunities in the future.
Online learning: Utilize platforms like Udemy for affordable or free courses to gain relevant skills.
Consider an entry-level position: Starting in an entry-level position related to your preferred field can be beneficial. It might not be your dream job, but the experience gained can set you on the right career path. For instance, if you aim to be a DevOps engineer, positions like Automation or IT (even a help-desk role) can be a stepping stone. However, if your goal is to become a Data scientist, starting as QA might not be so beneficial.
Consider enrolling in a bootcamp, but be aware of the two kinds available:
Paid bootcamps: These can be expensive, costing a few thousand dollars or even more (up to 20,000 NIS). They do not guarantee a job at the end of the course, but you can quit without an additional fee (although you won’t get a refund for the course fee).
Company-backed bootcamps: Some bootcamps are free and offer guaranteed job placement at their company or other partnering firms, but this is usually limited to exceptional students. If you join, be prepared to work at the company for 2–3 years, often at a lower salary compared to other companies. Quitting early may result in a significant penalty (up to 90,000 NIS).
As a recommendation, bootcamps can be optional and may vary in value depending on your field of interest. They can be beneficial for those without prior experience or relevant education (like a bachelor’s degree). However, if you have self-discipline and can learn independently, it might be better to consider a bootcamp as a last resort.

Never give up

Securing your first tech role may not always be easy. However, the key is to never give up, keep trying and put in effort every single day. Stay committed to the process, and success will come eventually! Best of luck!

Pay Less For Serverless: Practical Tips

Orel Bello — Mon, 21 Oct 2024 11:18:02 +0000

Intro

We all know the benefits of using serverless architecture, the concept is pretty simple: we pay AWS for managing the infrastructure for us so that we can focus solely on developing, instead of handling and maintaining the servers.

But what about the costs?

In a small environment with infrequent access, the serverless architecture can actually save you money — for example, when you don’t have traffic, the environment scales to zero and you don’t pay at all.

But in a large environment, such as ours at Melio (where all of our architecture is serverless), the price can spike and reach over $100K monthly on the Lambda functions alone, so what can we do to optimize it?

The first thing we need to do is to determine which services will be used in a serverless architecture, and then we can see how to optimize them.

This blog post will explore the various strategies for cost optimization in a serverless architecture, focusing on services and best practices to ensure efficient spending.

Who am I and why do I care about cloud costs?

My name is Orel Bello, and for the last two years, I’ve been working as a DevOps Engineer at Melio. I’m an AWS Certified Solution Architect Professional and Melio’s focal point for FinOps.

Since I started using AWS, I have been paying attention to every resource price, as it is a big part of the AWS Solution Architect Associate certification that I went through at the beginning of my cloud journey, so I knew we had a lot to cut from.

Recently Melio started the enrollment process for AWS EDP (Enterprise Discount Plan), which requires cost optimization before, so let’s start saving money:

Lambda Pricing

Before optimizing Lambda costs, it’s important to understand the pricing model.

You are charged based on execution time (measured in milliseconds) and the amount of memory allocated.

For example, a function with 128MB of memory (which costs $0.0000000021 per millisecond) and an execution time of 3 seconds would cost ($0.0000000021 * 3000 =) $0.0000063 per invocation.

If you double the memory and halve the execution time, the cost will remain roughly the same. However, the performance improvement might vary depending on the task.

Remember, each Lambda function handles only one request at a time. Therefore, more requests lead to more invocations, which increases costs.

Introducing AWS Lambda Power Tuning:

So you just created a new Lambda function, how do you choose how much RAM you need to allocate? (While you can’t directly adjust vCPU values, increasing RAM indirectly enhances vCPU performance too.)

This open-source tool can help you optimize your Lambda function and suggest the best power configuration to minimize cost and/or maximize performance.

It will run your function on a benchmark, suggesting the best values for RAM, and will also show the average execution time.

So by increasing RAM based on the results, you can make your Lambda function run faster, and you’ll pay less (or at least the same) because the execution time is reduced.

2. Why not just set the timeout to the max value?

Setting the timeout to the maximum can be costly because you are charged for every millisecond your Lambda function runs. If an error occurs and the function simply waits for the timeout (for example, when you’re accessing an unresponsive API), you will incur unnecessary charges. Therefore, it’s crucial to set the timeout to fit your specific needs.

To determine the correct timeout value for your Lambda function, you can use CloudWatch metrics or the Lambda Power tool. These tools provide the average execution time, allowing you to add a buffer for safety and set an appropriate timeout value.

3. Don’t put all your code inside the Lambda handler

Lambda functions operate within a virtual environment that persists across invocations, known as a microVM. However, it’s crucial to note that the main function code (the handler) is executed fresh each time it’s called. If you set up resources like a database connection within the handler, they are recreated with every call, slowing performance and potentially increasing costs.

To improve performance and cut expenses, it’s best practice to set up lasting resources, such as database connections, outside the handler. This enables subsequent invocations to reuse these established resources, leading to quicker execution and savings.

4. Migrate to ARM-based AWS Graviton processor:

Using ARM architecture with Graviton processors instead of x86 processors can reduce the overall cost of your Lambda function by up to 20% while improving performance by 19%!

The migration itself is pretty simple, and unless you have some dependencies or libraries that are using x86, you don’t need to take any further steps while migrating to the graviton processor.

Of course, it’s always best practice to run tests on Dev environments first before making changes on Production, but the transition itself should be pretty seamless.

5. Provisioned Concurrency — Don’t use it recklessly!

Provisioned Concurrency keeps your Lambda functions ‘warm’ and ready for action, making them execute faster by eliminating cold starts and improving performance.

It’s important to pay attention that you’re billed based on the number of provisioned concurrency units and the duration they’re active, and if you use it recklessly, it can become very expensive.

So what do you need to do?

Use Provisioned Concurrency only for production workloads with user-facing functions, and avoid using it in development environments.
Provision the minimum required amount of concurrency that your function will need (by analyzing application traffic patterns and performance requirements, you can use Cloudwatch ProvisionedConcurrencyUtilization metric for that). Remember that over-provision will just cause extra costs.
Use the auto-scaling feature of Provisioned Concurrency to gradually scale your function based on utilization, ensuring you avoid over-provisioning.

Also, functions with shorter execution times require less Provisioned Concurrency, so if you optimize your code and RAM configuration, and lower your execution time, you can also save money on the Provisioned Concurrency.

Remember: A serverless environment will not cost you money when there is no traffic, but you will pay for the provisioned concurrency! So even if you have an inactive environment, you must take it into account.

6. Don’t Use Sleep:

Did you ever need to wait for an operation that is running outside the Lambda function to finish? Did you use ‘sleep’ while you wait?

For those of you who aren’t familiar with the sleep method, it’s pretty straightforward — you specify the amount of time you want the function to wait for the external operation to finish.

So why is it bad practice to use it inside a Lambda function?

As you may already guess, it’s because we pay for the time that the Lambda function is waiting for.

So what can we do instead of using sleep?

7. Introducing Step Functions:

Step Function is a serverless orchestration service that integrates natively with Lambda function and a lot of other services, and lets you create a workflow like a state machine.

This can help us divide a large Lambda function that needs to wait for an I/O operation to finish into smaller functions, and add between them a logic that waits and checks if the I/O operation has finished, outside of our Lambda function, so we won’t pay for the function while it’s waiting!

So if the wait is free on Step Functions, whatis the pricing?

We pay per the transition.

Let’s take a look at a common use case:

We triggered an operation from the Lambda function, and set a loop to check when it’s done, with a ‘WAIT’ between each check.

If we want to save costs, we can define the waiting time with a greater value, which will lower the number of transitions and reduce the overall cost.

For a small Step function, it’s pretty insignificant, but on a large scale, this can get expensive.

8. Compute Saving Plan:

So what is the AWS Compute Savings Plan?

You basically commit to using AWS Lambda for the next 1–3 years, and in exchange, you get a discount of up to 17% (the Compute Savings plan is also applied for EC2 and Fargate, and can reach an even greater discount of 66%).

The pricing model of a Savings Plan is more flexible than RIs (Reserved Instances), as you aren’t bound to use a specific instance type or a specific region.

If you’re afraid of the commitment, you can always choose the most basic option of 1 year with no upfront payment. If you’re working at a steady pace with a solid usage of Lambda functions, using saving plans should be a no-brainer.

9. Logs — storing is cheap, writing is EXPENSIVE

Logs are crucial, we just can’t live without them.

But, do we really need all of our logs? There are a few types of logs, such as DEBUG, INFO, WARN, ERROR, and FATAL, starting from the most common in decreasing order.

Do we really need to write them at such a high frequency? Is any INFO message really needed?

Also, if we’re using a thirdparty monitoring tool, which itself costs a lot, do we really need to write the logs to Cloudwatch as well?

We need to understand that nothing is free and writing logs costs money, and with some work, we can save a lot of money!

So what can you do?

Ensure that only crucial logs are written. (You can do so by utilizing the FingersCrossedHandler library, which sends logs only when errors occur).
Add a retention to delete the logs oncethey’re no longer needed (or archive them in S3 Glacier).
When applicable, consider using the new Infrequent Access tier on Cloudwatch, which can save you up to 50% on log group costs. (Please pay attention that it doesn’t fit every use case, as it doesn’t support real-time monitoring, metric filters, subscriptions filter, and log anomalies).

10. VPCE

This awesome feature is not unique for a serverless-based architecture, but it’s a must-have!

Basically, instead of getting out from your VPC to AWS via the NAT GW, which isn’t cheap, you can use the backbone network of AWS to connect to AWS services directly from your VPC, without traversing the public internet.

This solution is more secure, efficient, and cost-effective, and you can use it with different services, such as S3, DynamoDB, ECR, EC2, Lambda, KMS, SSM and so on.

This simple yet powerful feature can reduce your data processing costs and save you some money.

Conclusion

Cost optimization in a Serverless environment i’s (almost) all about the Lambda function.

There is no doubt that this kind of cost optimization requires more effort, both from the DevOps and the Developers, and there are not that many low-hanging-fruits, but once you define guidelines in your organization, and enforce them, you will be able to save a lot of money.

visit our career website

How did we reduce our monthly AWS bills by 20% without breaking a sweat?

Orel Bello — Wed, 05 Jun 2024 12:49:17 +0000

Intro

One of my many tasks as a DevOps engineer in Melio was to reduce our cloud cost.

Ok…it wasn’t my task, but I made it mine.

I saw the enormous price we paid every month and I just couldn’t stand by, I wanted to do something about it.

Who am I and why do I care about cloud costs?

My name is Orel Bello, and for the last year, I’ve been working as a DevOps Engineer on the SRE (Site Reliability Engineering) team in Melio. I started as a Deputy Commander in the Technological Control Center for Israel Police as part of my military service. I then completed my BS.c In computer science, and started working as a Storage And Virtualization Engineer. After a year and a half, I realized that I wanted to be a Devops engineer, and I got into my first Devops position right before I got started at Melio.

Since I started using AWS, I have been paying attention to every resource price, as it is a big part of the AWS solution Architect Associate that I went through at the beginning of my cloud journey, so I knew we had a lot to cut from.

So, what’s the challenge?

As we faced a lot of challenging and more urgent tasks in our day-to-day work, reducing our cloud cost wasn’t a priority. I had to find a way to do it with minimum effort and without the help of the R&D.

Getting started…

I started to dig up our bills, and I saw many different metrics, but I didn’t know what they meant.

One thing caught my eye — Cloudwatch’s cost was high (about $20,000 monthly).

After a little research, I discovered that we don’t have a retention policy for our log groups, so we keep them forever.

I wanted to set a life-cycle policy (similar to the one S3 has natively) to set the retention policy to 3 months and then export the log groups to an S3 bucket to archive it since it’s a much cheaper storage solution. However, I was amazed to see that there wasn’t a built-in automated option for it, so I had to build one of my own (Using step function and lambdas, it was really fun to build).

How does it work?

At Melio, we store log groups in CloudWatch to meet compliance requirements. However, due to the high costs associated with CloudWatch, we devised a cost-effective solution: exporting log groups to a more economical storage option — S3 buckets.

We implemented a custom solution to automate this export process using AWS Step Functions triggered by an event bus. Here’s a breakdown of the process, which occurs every three months:

DynamoDB Table Creation:

Create a DynamoDB table containing the names of all log groups. This table acts as a registry for managing the export process.

2. Export Task Initialization:

Retrieve the last item from the DynamoDB table, initiating an export task for the corresponding log group. Subsequently, remove the item from the table.

3. Set Retention Policy:

Apply a retention policy of 3 months to the log group that was exported successfully, ensuring that only relevant data is retained in CloudWatch

4. Task Status Monitoring:

Check if the DynamoDB table is empty. If it is, the export process is complete. If not, wait for 15 minutes and monitor the status of the ongoing export task.

5. Task Completion Check:

If the export task is marked as done, start the next export task. If not, wait for 15 minutes and recheck the status.

This systematic approach ensures that log groups are exported to S3, reducing costs while adhering to compliance requirements. The periodic execution every three months guarantees that only necessary data remains in CloudWatch, contributing to significant cost savings over time.

After a month or two, I noticed the costs were decreasing less than anticipated. In addition to our custom solution effectively managing data retention and export, diving into CloudWatch metrics revealed another key expense: Ingested data cost.

While this solution remains beneficial for those with substantial expenses on CloudWatch log groups, I felt the need to delve deeper and explore additional avenues for savings.

Cloudwatch: the big money lies in writing, not in storing

I deep-dived into our billing metrics and saw that the price of the Ingested-Data (The writing to the log group) makes up most of our Cloudwatch’s cost, while our Stored-Bytes (The storage of the log groups) was pretty low, so I had to change the tactic.

I found out that we have three log groups that produce so many logs, that each log group costs more than $ 1,500$ monthly! Luckily, those log groups are pretty common and you can also benefit from it.

The first one was VPC Flow logs (Which record all the traffic that enters the VPC, useful for security and debugging purposes), which we simply modified to write logs into an S3 bucket instead of Cloudwatch (If you don’t need it you can just disable it), doing that saved us 1,500$ monthly!

Cloudtrail, when not properly configured, is REALLY expensive

Next, there was the Cloudtrail logs group. Cloudtrail is a useful (And expensive) service of AWS that records any action performed inside the AWS account.

We had two separate Cloudtrail log groups that we simply disabled and deleted (we didn’t even need them, since they were saved in S3 and the Cloudtrail dashboard as well).

And just like that, we saved another 4,000$!

After I saw how expensive the Cloudtrail log groups were, I decided to take another look at them. I found out that we have a duplicate trail, so we were paying extra — I just didn’t know how much extra. Disabling the additional trial resulted in saving 27,000$ per month! We went from paying 30,000$ monthly, to reducing the costs to only *3,000$ * monthly.

RIs and Savings Plans — the first steps toward cost optimization

One of the most common and simple ways to save costs is by purchasing Reserved Instances (RIs) and Saving Plans.

RIs and Savings Plans are similar, but with some key differences:

RIs are tied to a specific instance type in a specific region, so if you want to change a region to a different instance class mid-year, you will still be paying for the RIs you bought and are no longer using. Savings Plans, on the other hand, allow you the flexibility to switch between instance families, sizes, and OS within the same region. Both require a commitment of 1–3 years.

We already had a compute Saving Plan, which saved around 8,000$ per month (It’s valid for EC2, ECS, and Lambda functions, our architecture is mostly serverless so it was good for our needs). I purchased RIs for RDS (Relational Database Service), with the most basic plan (A 1-year commitment with no upfront cost, so you don’t have any reason not to use it!). Then we saved another 10,500$ per month.

Keep an eye out for unknown bills — you might be surprised

Last but not least, I saw an odd bill for a new service, called Security Lake. It was costly (Around 10,000$ per month), so I decided to check with the relevant team. The service didn’t provide enough value for them to justify its expensive price tag, so we disabled it and saved another 10,000$.

Conclusion

This was the first phase of reducing our cloud costs. The rest of the savings won’t be as easy to achieve, but will be worth it!

Remember that cost optimization is all about monitoring. `You should check each month that you don’t see unfamiliar bills or anomalies, and work constantly to reduce extra costs.

First, you need to pinpoint your most expensive services, prioritizing quality over quantity. It’s important to choose your battles wisely, you can’t optimize all of your costs (OK, you can but some of them are not worth the trouble, so make sure to focus on the most impactful ones).

It’s very satisfying to help make a difference with so little effort. I encourage you to try it yourself. Saving money for your organization can impact its growth, and you can take some of the credit for it. 🙂

Visit our career website