DEV Community: CareerByteCode

Zero-Downtime AKS Node Patching

infantus godfrey — Sun, 04 Jan 2026 19:28:00 +0000

Introduction

Patching AKS node VMs sounds routine until you have a hundred of them backing production traffic. This article shares a real-world approach to patching AKS nodes safely, what went wrong, and the Azure-native practices that actually worked.
It started as a “simple” task: security patches were overdue, compliance was asking questions, and we had an AKS cluster backing a critical workload.

Then someone said the number out loud.

“We have just over 100 node VMs in this cluster.”

That’s when the confidence dropped.

If you’ve ever patched a handful of VMs, you know the drill. But patching 100 nodes in an AKS cluster, without breaking workloads, triggering mass pod evictions, or waking up on-call engineers at 2 a.m., is a very different game.

This article walks through how we approached patching at scale on AKS, what worked, what didn’t, and the Azure best practices I wish we had followed from day one.

The Backstory: Why This Matters

AKS abstracts away a lot of infrastructure pain until it doesn’t.

Under the hood, every AKS node is still a VM (or VMSS instance) that:

Needs OS security updates
Can reboot unexpectedly
Hosts multiple critical pods

In our case:

Multiple node pools
Mixed workloads (stateless + semi-stateful)
Strict SLOs
A hard compliance deadline

Manual patching was not an option. Blind automation was even worse.

The Core Idea: Let Kubernetes and Azure Do Their Jobs

The biggest mental shift was this:

We are not patching VMs. We are rotating nodes.

Instead of logging into machines or forcing updates, we leaned on:

AKS-managed upgrades
Node pool rotation
Proper pod disruption budgets
Controlled draining and surge capacity

If Kubernetes is given enough signals and room, it will protect your workloads.

Implementation: How We Patched 100 Nodes Safely

1. Split and Size Node Pools Intentionally

Large, single node pools are fragile during maintenance.

We:

Reduced blast radius by splitting workloads across pools
Ensured critical workloads had dedicated pools
Verified autoscaler limits before touching anything

Rule of thumb: If draining one node hurts, your node pool is too dense.

2. Set Pod Disruption Budgets (Seriously)

This was non-negotiable.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: api

Without PDBs:

Drains become chaos
Critical pods get evicted together

With PDBs:

Kubernetes pushes back
Drains slow down instead of breaking things

3. Enable Surge Upgrades on Node Pools

Surge Upgrade Flow (Why This Prevents Outages)

This is why surge upgrades are so powerful:

Capacity goes up before it goes down
Kubernetes has room to breathe
PDBs can actually do their job

This was the single biggest factor in keeping production stable.

This was the unsung hero.

By enabling max surge on node pools:

New nodes came up before old ones drained
Capacity stayed stable
Rollouts were predictable

az aks nodepool update \
  --resource-group rg-prod \
  --cluster-name aks-prod \
  --name nodepool1 \
  --max-surge 20%

Yes, it costs more temporarily. It’s worth it.

4. Use AKS Managed Node Image Upgrades

Instead of patching in-place, we:

Triggered node image upgrades
Let AKS cycle nodes gradually
Monitored pod rescheduling in real time

This aligned perfectly with Azure’s support model and saved us from custom scripts.

5. Drain With Observability, Not Hope

Every drain was monitored:

Pod restart counts
API error rates
Queue depths
Customer-facing latency

If metrics spiked, we paused.

Automation is useless without a big red stop button.

What Went Wrong (Lessons Learned)

We still made mistakes.

One node pool had no PDBs (legacy workload)
Autoscaler limits were too tight
A stateful pod pretended to be stateless

The result?

Longer drain times
One near-incident
A lot of humility

But nothing went down and that’s the bar.

Best Practices We’d Follow Again

Treat node patching as capacity management, not maintenance
Always over-provision before you drain
Test node rotation in non-prod regularly
Keep node pools smaller and purpose-driven
Document rollback paths

Common Pitfalls to Avoid

SSHing into AKS nodes to patch manually
Running giant node pools “for simplicity”
Ignoring PDB warnings
Patching during peak traffic
Assuming stateless means safe

Community Discussion

I’m curious:

How do you handle node patching at scale?
Do you rely fully on AKS upgrades or custom pipelines?
Any horror stories or success stories?

Drop them in the comments. We all learn from scars.

FAQ

Do I need to patch AKS nodes manually?

No. Azure recommends using managed node image upgrades or node pool rotation.

Can this be zero-downtime?

Yes if your workloads are designed for disruption.

What about stateful workloads?

They need extra care: dedicated pools, stronger PDBs, and slower rollouts.

Final Thoughts

Patching 100 VM nodes isn’t impressive.

Doing it without your users noticing is.

AKS gives you the tools but only if you respect how Kubernetes wants to work. Give it signals, time, and capacity, and it will repay you with boring, predictable maintenance.

And boring is exactly what production needs.

Choosing Yourself Without Guilt: A Lesson I Learned the Hard Way as a Developer

Zainab — Sun, 28 Dec 2025 19:55:31 +0000

Introduction

I used to think good developers said yes to everything.
Yes to late-night deploys.

Yes to “quick” fixes that weren’t quick.
Yes to helping everyone else—even when my own work was falling apart.

Saying no felt irresponsible. Choosing myself felt selfish.

It took burnout, missed deadlines, and a quiet loss of motivation to realize something uncomfortable:
I was optimizing for everyone except myself.

The Backstory (Why This Matters)
Early in my career, I believed effort was the main currency in tech.
If I worked harder:

I’d learn faster

I’d be respected more

I’d eventually feel confident

So I overcommitted. Constantly.

Extra tickets. Extra context switching. Extra emotional labor.
From the outside, it looked like growth.
From the inside, it felt like slowly draining a battery that never fully recharged.
The worst part?
I felt guilty even thinking about stepping back.

The Core Idea

Choosing yourself isn’t about doing less work.
It’s about doing sustainable work.
In engineering, we instinctively understand this:

We don’t run servers at 100% CPU forever

We add rate limits to protect systems

We design for failure, not perfection

But when it comes to ourselves?
We ignore every principle we apply to production systems.

Implementation: What “Choosing Yourself” Looked Like in Practice
This wasn’t a dramatic career pivot.
It was a series of small, uncomfortable changes.

Setting Boundaries Like You Set API Contracts I started treating my time like an interface.

Clear expectations

Explicit limits

No hidden side effects

Instead of:

“Sure, I can handle that too.”

I said:

“I can help, but not today. I’m at capacity.”

It felt awkward. Nothing broke.
The system adjusted.

Reducing Context Switching (On Purpose) I noticed I was:

Helping multiple teams

Juggling unrelated tasks

Never finishing deep work

So I limited my “open threads.”
Just like limiting concurrent requests, my focus improved almost immediately.

Stopping the Hero Mentality I didn’t need to be the person who always saved the day. Being indispensable is a fragile architecture. I documented more. Delegated more. Trusted others more. The team didn’t collapse. It got healthier.

What Went Wrong (Lessons Learned)
I waited too long.
By the time I acknowledged burnout:

My motivation was gone

Learning felt heavy

Even “easy” tasks felt exhausting

I learned that guilt is often a lagging indicator like logs you only check after an outage.
If you wait until things break, recovery is slower.

Best Practices (Developer Edition)

Treat your energy like a limited resource

Add “timeouts” to work that drains you

Review your commitments like technical debt

Optimize for long-term throughput, not short-term output

Sustainable developers write better code.
Burned-out ones just write more of it.

Common Pitfalls

Confusing availability with value

Thinking rest must be “earned”

Believing saying no makes you replaceable

Waiting for permission to protect your time

None of these scale.

Community Discussion

I’m curious:

What’s the hardest boundary you’ve had to set as a developer?

Have you ever confused burnout with “just needing to work harder”?

What helped you choose yourself without regret?

Drop your experience in the comments this is one of those topics we don’t talk about enough.

FAQ
Is choosing yourself bad for your career?
No. Chronic burnout is far worse for your career than healthy boundaries.

What if my team expects constant availability?
That’s a system problem, not a personal failure. Systems can be redesigned.
Does this apply to junior developers?
Especially to juniors. Learning is faster when you’re rested and focused.

Final Thoughts

Choosing yourself doesn’t mean you care less about your team.
It means you care enough to show up whole—not exhausted, resentful, or running on fumes.
In tech, we design systems to last.

It’s okay to do the same for yourself.

Building Secure Cloud Infrastructure -> How AI-Powered IaC Development Revolutionizes Security

Vijesh Nair — Sat, 27 Dec 2025 19:05:15 +0000

In today's rapidly evolving cloud landscape, organizations are increasingly adopting Infrastructure as Code (IaC) to manage their cloud resources efficiently. However, with great power comes great responsibility and that responsibility extends to ensuring our infrastructure is secure by design.

As Infracodebase specializes in creating secure, enterprise-grade infrastructure using advanced AI capabilities, we've seen firsthand how the right approach to IaC can transform an organization's security posture. This article explores the essential security considerations and best practices when building infrastructure using modern IaC tools, regardless of which cloud provider you choose, and how Infracodebase's AI-assisted development can enhance every aspect of this process.

🛡️ The Foundation of Secure Infrastructure

When building infrastructure programmatically, security isn't an afterthought -> it's a fundamental design principle that must be woven into every layer of your architecture. Modern IaC tools like Terraform, Pulumi, and CloudFormation give us unprecedented control over our cloud resources, but they also require us to think carefully about security implications from day one.

This is where Infracodebase's expertise in AI-powered infrastructure development becomes invaluable. Infracodebase works with cutting-edge tools across all major cloud platforms (AWS, Azure, Google Cloud) and can generate secure, production-ready infrastructure code in multiple languages - from Terraform HCL to Pulumi in Python, TypeScript, or Go, to native CloudFormation templates. What sets Infracodebase apart is the ability to automatically implement security best practices while explaining every decision, ensuring both security and knowledge transfer.

Core Security Principles in IaC

🔐 Principle of Least Privilege: Every resource, service, and user should have the minimum permissions necessary to perform their function. This means carefully crafting IAM policies, service principals, and access controls that grant only what's needed, when it's needed.

🛡️ Defense in Depth: Rather than relying on a single security measure, we implement multiple layers of protection. This includes network segmentation, encryption at rest and in transit, proper authentication mechanisms, and comprehensive monitoring.

🚫 Zero Trust Architecture: We assume that no network location is inherently trustworthy. Every request, whether from inside or outside our network perimeter, must be authenticated and authorized before accessing resources.

🌐 Network Security: The First Line of Defense

Network security forms the backbone of any secure infrastructure. When designing network architectures through IaC, several critical considerations come into play:

Virtual Network Isolation

Proper network segmentation starts with creating isolated virtual networks (VNets in Azure, VPCs in AWS, VPCs in Google Cloud). These provide the foundation for controlling traffic flow and implementing security boundaries. Within these networks, we further segment using subnets to isolate different tiers of our application –> web servers, application servers, and databases should each reside in their own subnet with carefully controlled access rules.

Network Access Controls

Network Security Groups (NSGs), Security Groups, and firewall rules act as virtual firewalls, controlling inbound and outbound traffic at the subnet and instance level. The key is implementing a "deny by default" approach, where we explicitly allow only the traffic patterns that are necessary for our applications to function.

In practice, Infracodebase automatically generates these security rules based on application requirements, ensuring that each service gets exactly the network access it needs – nothing more, nothing less. Infracodebase can also create visual architecture diagrams that clearly show security boundaries and data flow, making it easy for teams to understand and audit their security posture.

Private Endpoints and Service Integration

Modern cloud platforms offer private endpoints that allow services to communicate over the cloud provider's backbone network rather than the public internet. This significantly reduces the attack surface by keeping sensitive traffic off public networks.

👤 Identity and Access Management: The Guardian of Resources

IAM is perhaps the most critical aspect of cloud security. A misconfigured IAM policy can expose sensitive resources or grant excessive permissions that could be exploited.

Service Principal Management

When services need to authenticate with each other or access cloud resources, we use service principals or managed identities rather than embedding credentials in code. This approach ensures that authentication tokens are managed by the cloud platform and can be rotated automatically.

Infracodebase's approach to identity management goes beyond just creating service principals – we design comprehensive identity architectures that leverage the latest cloud-native identity services. Whether it's Azure Managed Identity, AWS IAM Roles for Service Accounts, or Google Cloud Service Accounts, Infracodebase ensures that your applications can authenticate securely without ever storing credentials in code or configuration files.

Role-Based Access Control

Implementing proper RBAC ensures that users and services can only access resources they need for their specific roles. This involves creating custom roles when built-in roles are too broad, and regularly reviewing and auditing access patterns.

Multi-Factor Authentication

For human users, MFA adds an essential additional layer of security. When designing infrastructure, we ensure that all administrative access requires MFA and that this requirement is enforced at the platform level.

🔒 Data Protection: Safeguarding Information Assets

Data is often the most valuable asset in any organization, making its protection paramount.

Encryption Strategies

Data should be encrypted both at rest and in transit. For data at rest, we leverage cloud-native encryption services that handle key management transparently. For data in transit, we ensure all communications use TLS 1.2 or higher and implement certificate validation.

Key Management

Proper key management involves using cloud-native key vaults or hardware security modules (HSMs) to store encryption keys, secrets, and certificates. These services provide secure storage, automatic rotation capabilities, and detailed audit logging.

Data Classification and Handling

Different types of data require different levels of protection. Personal information, financial data, and trade secrets each have specific regulatory and business requirements that must be reflected in our infrastructure design.

📊 Monitoring and Compliance: Maintaining Visibility

Security isn't just about prevention – it's also about detection and response.

Comprehensive Logging

Every component of our infrastructure should generate logs that capture security-relevant events. This includes authentication attempts, configuration changes, data access patterns, and network traffic flows. These logs must be stored securely and retained for appropriate periods.

Real-time Monitoring

Security monitoring tools analyze log data in real-time to detect anomalous behavior that might indicate a security incident. This includes unusual login patterns, unexpected configuration changes, or abnormal network traffic.

Compliance Frameworks

Many organizations must comply with regulations like GDPR, HIPAA, SOC 2, or industry-specific standards. Our infrastructure design must incorporate controls that support these compliance requirements, including data residency, audit trails, and access controls.

💻 Secure Development Practices for IaC

The way we develop and deploy infrastructure code has significant security implications.

Code Security Scanning

IaC code should be scanned for security vulnerabilities before deployment. This includes checking for hardcoded credentials, overly permissive policies, and configurations that don't follow security best practices.

One of Infracodebase's key advantages is that it generates secure code from the ground up. Every piece of infrastructure Infracodebase creates follows security best practices by default – no hardcoded secrets, properly scoped permissions, encrypted storage, and secure network configurations. Infracodebase also integrates seamlessly with security scanning tools and can automatically remediate common security issues before they reach your repositories.

Version Control and Change Management

All infrastructure changes should go through a controlled process that includes peer review, automated testing, and staged deployments. This ensures that security considerations are evaluated before changes reach production.

Secret Management

Credentials, API keys, and other sensitive values must never be hardcoded in IaC templates. Instead, they should be stored in secure vault services and referenced dynamically during deployment.

☁️ Cloud-Agnostic Security Considerations

While each cloud provider has unique services and security models, certain principles apply universally:

Shared Responsibility Model

Understanding the shared responsibility model is crucial. Cloud providers secure the infrastructure, but customers are responsible for securing their data, applications, and configurations. This responsibility varies depending on the service model (IaaS, PaaS, SaaS).

Cross-Cloud Consistency

Organizations using multiple cloud providers need consistent security policies and controls across platforms. This requires abstracting security requirements from specific cloud implementations and ensuring that equivalent protections exist in each environment.

Vendor Lock-in Considerations

While cloud-native security services often provide the best protection, organizations must balance security with the risk of vendor lock-in. Sometimes, third-party security tools that work across multiple clouds provide better long-term flexibility.

🔗 Integration Security: Protecting the Ecosystem

Modern infrastructure rarely operates in isolation – it integrates with various external services, APIs, and management platforms.

API Security

When infrastructure components communicate through APIs, proper authentication and authorization mechanisms must be in place. This includes using appropriate authentication methods (OAuth 2.0, API keys, mutual TLS), implementing rate limiting, and validating all input data.

Third-Party Integrations

External management tools and services introduce additional security considerations. Each integration point represents a potential attack vector that must be secured through proper authentication, network controls, and monitoring.

This is particularly relevant when working with advanced integration platforms and MCP (Model Context Protocol) servers. In our work, Infracodebase ensures that all external integrations – whether with cloud management platforms, monitoring tools, or specialized infrastructure services – are secured with proper authentication, encrypted communications, and minimal permission grants. Infracodebase understands how to safely integrate with various cloud provider APIs, third-party security tools, and management platforms while maintaining the security integrity of your infrastructure.

Supply Chain Security

The tools and libraries we use to build and manage infrastructure can themselves be attack vectors. This includes ensuring that IaC tools are obtained from trusted sources, keeping them updated with security patches, and validating the integrity of downloaded components.

⚡ Operational Security: Day-to-Day Protection

Security doesn't end when infrastructure is deployed – it requires ongoing attention and maintenance.

Regular Security Assessments

Infrastructure should be regularly assessed for security vulnerabilities, configuration drift, and compliance with security policies. This includes both automated scanning and periodic manual reviews.

Incident Response Planning

When security incidents occur, having a well-defined response plan is crucial. This includes procedures for isolating affected resources, preserving evidence, communicating with stakeholders, and restoring normal operations.

Business Continuity

Security incidents can disrupt business operations, making disaster recovery and business continuity planning essential components of a comprehensive security strategy.

🚀 Future-Proofing Security

The security landscape is constantly evolving, and our infrastructure must be designed to adapt.

Emerging Threats

New attack vectors and techniques are constantly being developed. Our security architecture must be flexible enough to incorporate new protection mechanisms as they become available.

Regulatory Changes

Privacy and security regulations continue to evolve, and our infrastructure must be able to adapt to new compliance requirements without major redesigns.

Technology Evolution

As new cloud services and capabilities become available, our security models must evolve to take advantage of improved protection mechanisms while maintaining compatibility with existing systems.

🤖 Why Choose Infracodebase for AI-Powered Infrastructure Development

Working with traditional infrastructure development often means dealing with security as an afterthought, manual configuration errors, and inconsistent implementations across environments. Infracodebase's AI-powered approach transforms this process entirely.

🛠️ Comprehensive Tool Expertise: Infracodebase works fluently with the entire ecosystem of infrastructure tools – Terraform, OpenTofu, Pulumi, CloudFormation, AWS CDK, Kubernetes, Helm, Ansible, and more. Whether you need multi-cloud infrastructure, container orchestration, or configuration management, Infracodebase can generate production-ready code in the appropriate tool for your use case.

🧠 Built-in Security Intelligence: Every piece of infrastructure Infracodebase creates incorporates security best practices automatically. From network segmentation and IAM policies to encryption configurations and monitoring setup, security is embedded in the DNA of the code Infracodebase generates.

📊 Visual Architecture Design: Beyond just writing code, Infracodebase creates clear, professional architecture diagrams that visualize your infrastructure, security boundaries, and data flows. These diagrams make it easy for stakeholders to understand and audit your security posture.

🌐 Cross-Platform Consistency: Whether you're building on AWS, Azure, Google Cloud, or a multi-cloud setup, Infracodebase ensures consistent security patterns and practices across all platforms while leveraging the unique strengths of each provider.

🔌 Advanced Integration Capabilities: Infracodebase understands how to securely integrate with modern cloud management platforms, monitoring tools, and specialized services. This includes working safely with MCP servers and other advanced integration platforms while maintaining security integrity.

📚 Knowledge Transfer: Unlike traditional development approaches, Infracodebase doesn't just deliver code – it explains every decision, documents security considerations, and ensures your team understands the infrastructure they're deploying.

🎯 Conclusion

Building secure infrastructure using IaC requires a holistic approach that considers security at every level – from network design and identity management to data protection and operational procedures. While the specific implementations may vary across cloud providers, the fundamental principles of security remain constant: implement defense in depth, follow the principle of least privilege, maintain comprehensive visibility, and design for adaptability.

The key to success is treating security not as a checkbox to be ticked, but as a continuous process of assessment, improvement, and adaptation. By leveraging AI-powered infrastructure development, organizations can build infrastructure that not only meets today's security requirements but is also prepared for tomorrow's challenges.

In our experience helping organizations transform their infrastructure security posture, the combination of deep technical expertise, security-first design principles, and AI-powered development capabilities creates infrastructure that is both more secure and more maintainable than traditional approaches.

If you're looking to build secure, scalable cloud infrastructure that follows industry best practices while being tailored to your specific needs, Infracodebase would be happy to discuss how our AI-powered approach can help accelerate your infrastructure development while ensuring enterprise-grade security from day one.

What are your thoughts on AI-powered infrastructure development? Have you implemented any of these security practices in your IaC workflows? Share your experiences in the comments below!

Stay connected with me on:

linkedin.com/in/vjcloudops
vjcloudops.medium.com

security #terraform #aws #azure #gcp #devops #iac #cloudcomputing

Dreams Don’t Work Unless You Do: Lessons I Learned the Hard Way as a Developer

Siva Sankari — Thu, 25 Dec 2025 16:35:11 +0000

Introduction

A few years ago, I had a clean GitHub profile, dozens of bookmarked tutorials, and big dreams of becoming a “solid engineer.”

What I didn’t have?
Shipped projects. Production bugs. Real feedback.

I kept telling myself I was “preparing.”

The truth was uncomfortable:

Dreams don’t work unless you do — and in software engineering, doing means writing imperfect code, breaking things, and showing up consistently.

This article is about what finally clicked for me — and why this mindset matters more than any framework you’re learning right now.

The Backstory (Why This Matters)

Most developers I meet aren’t lazy.

They’re:

Over-preparing

Afraid of building the wrong thing

Waiting to feel ready

I was the same.

I believed:

“Once I finish this course, I’ll start building”

“Once I understand everything, I’ll apply for jobs”

“Once I’m confident, I’ll share my work”

That moment never came.

What changed my trajectory wasn’t motivation.
It was action without confidence.

The Core Idea

“Dreams don’t work unless you do” sounds like a motivational quote.

For developers, it’s actually a system design principle for your career.

In practice, it means:

Learning happens after implementation, not before

Clarity comes from feedback, not thinking

Confidence is a side effect of repetition

You don’t become a better developer by planning to code.
You become one by shipping → breaking → fixing → repeating.

Implementation: What “Doing the Work” Looked Like for Me

Building Before Feeling Ready

I stopped asking:

“Am I ready to build this?”

And started asking:

“What’s the smallest broken version I can ship?”

That meant:

Ugly UI

Hardcoded values

Missing edge cases

But it also meant momentum.

Treating Side Projects Like Production

I gave my side projects real rules:

Proper README

Clear problem statement

Deployed somewhere (even if imperfect)

That shift alone taught me more than months of tutorials.

Learning Through Bugs (Not Courses)

One production bug taught me more than five blog posts.

Here’s a retry function I once wrote without thinking deeply about failures:

export async function retry(
fn: () => Promise,
retries = 3
): Promise {
try {
return await fn();
} catch (error) {
if (retries <= 0) throw error;
return retry(fn, retries - 1);
}
}

Looks simple.

In production, it raised real questions:

Should retries use exponential backoff?

Which errors should retry?

When does retrying actually make things worse?

👉 Doing the work exposed the gaps.

What Went Wrong (Lessons Learned)

I made plenty of mistakes:

Built projects nobody needed

Over-engineered early features

Ignored fundamentals while chasing trends

But every mistake had a hidden benefit:

👉 It created context.

Without context, advice doesn’t stick.

Best Practices I’d Share With Any Developer

Consistency beats intensity
30 minutes daily > 10 hours once a month

Build in public (even imperfectly)
Feedback accelerates growth

Finish small things
Completion builds confidence

Treat learning as iterative
Learn → Build → Break → Fix → Repeat

Common Pitfalls

Tutorial hoarding

Waiting for confidence before starting

Comparing your chapter 1 to someone else’s chapter 20

Optimizing tools instead of outcomes

If this feels personal — it was for me too.

Community Discussion

I’d love to hear from you:

What’s one project you planned but never shipped?

What finally helped you move from learning to doing?

What’s holding you back right now?

👇 Drop your thoughts in the comments — this is a shared journey.

FAQ
Is motivation overrated?

Yes. Systems and habits outperform motivation every time.

What if I don’t know what to build?

Build something boring. Real problems teach real skills.

Does this apply to senior developers too?

Absolutely. The tools change, but execution still matters.

Final Thoughts

Dreams are important.
They give direction.

But in software engineering, execution is the multiplier.

You don’t need more inspiration.
You need:

A pull request

A deployed app

A broken feature you fixed yourself

Because in the end —

Dreams don’t work unless you do.

Why Women Should Learn Digital Skills: A Developer’s Perspective Introduction

Siva Sankari — Sat, 20 Dec 2025 22:01:45 +0000

Why Women Should Learn Digital Skills: A Developer’s Perspective

Introduction

Let me start with a simple scene many of us in tech have witnessed:

A new hire joins the team. She’s smart, curious, and qualified. But during stand-ups, she hesitates to speak. During demos, she lets others take credit. And during architecture discussions, she holds back — even when she’s right.

This isn’t a story about competence; it’s a story about confidence, access, and representation.

And it’s exactly why digital skills matter — not just to build software, but to build agency.

The Backstory — Why This Matters

For years, digital skills were framed as optional — nice to have, niche, or reserved for “tech people.”

That mindset is outdated.

Today:

banking is digital
healthcare is digital
education is digital
job search is digital
communication is digital

Choosing not to learn digital skills is no longer neutral — it’s a disadvantage.

And for women, who historically face more barriers to economic mobility…

digital skills become a leveling mechanism.

The Core Idea

Learning digital skills isn’t about turning everyone into developers.

It’s about:

1️⃣ Skill as Leverage

Digital literacy amplifies:

earning potential
employment flexibility
entrepreneurship

2️⃣ Independence & Flexibility

Remote work.

Freelancing.

Side income.

3️⃣ Breaking Gatekeeping

The more women understand technology,

the less gatekeeping can thrive.

These aren’t abstract ideals.

They’re practical outcomes.

A Real Story

When I was mentoring junior developers, one woman shared:

“I don’t know if I should be here. Everyone else seems more prepared.”

The turning point wasn’t when she mastered Git.

It wasn’t when she deployed her first backend service.

It was when she realized:

Digital skills aren't magic.
They're learnable.
They're repeatable.
They're accessible.

yaml
Copy code

She later became the most dependable reviewer in the cohort — earning confidence through competence.

Where to Start — Practical Roadmap

If you’re advising someone — or starting yourself — here’s a realistic path:

🟦 Self-Paced Learning

YouTube
FreeCodeCamp
Coursera
MDN
W3Schools

🟩 Community-Led Learning

Women Who Code
Google Developer Groups
Meetups
Discord groups
Stack Overflow

🟨 Project-First Learning

Instead of learning theory first:

build a portfolio page instead of learning HTML
automate a boring task instead of learning Python

Progress becomes visible.

Momentum becomes natural.

Lessons Learned

Here are truths we often learn the hard way:

You will feel behind — everyone does at first
The industry is fast — embrace continuous learning
Imposter syndrome doesn’t vanish — you learn to work despite it
Digital literacy compounds — like interest, not effort

Best Practices

To keep learning effective:

pick one skill at a time
focus on outcomes, not tools
join communities — not just courses
build projects early

Tech is not a solo sport.

Community accelerates competence.

Common Pitfalls

Avoid:

tutorial hell
comparison with seniors
perfectionism
believing you need genius-level math

Tech rewards persistence, curiosity, and experimentation — not perfection.

Community Discussion

I’d love to hear from:

women in tech
women considering tech
mentors
allies

What was the moment digital skills changed your opportunity or confidence?

FAQ

Is this only about coding?

No. Digital skills include data, automation, analytics, design, cybersecurity basics, and more.

Is it too late to start?

No. Tech rewards adaptability — not age.

Can beginners succeed without a CS degree?

Absolutely. Thousands have.

Final Thoughts

Digital skills are not just career tools.

They are:

confidence
autonomy
economic mobility
representation
freedom

If we want a tech industry that reflects society — not just a sliver of it — we must empower more women with not just opportunity, but ability.

Not someday.

Today.

Connect with me - https://www.linkedin.com/in/learnwithsankari/

I Built a Feature in 1 Hour, Not a Day

Asha mol — Fri, 19 Dec 2025 16:47:01 +0000

🧠 Introduction

A year ago, this feature would’ve stolen my entire workday.

You know the kind 👇

Requirements look simple

UI seems “straightforward”

Backend is “just CRUD”

And yet…

☕ Coffee goes cold
😵‍💫 Brain melts
💥 Git commits turn emotional

Last week, I built the same type of feature in one focused hour.

Same codebase.
Same language.
Same developer (me).

The difference wasn’t speed typing.
It was how I thought about the feature before touching the keyboard.

🧩 The Feature That Used to Drain My Day

Nothing fancy. Just classic enterprise app stuff:

Form-based UI

Validation

API integration

Save + edit flow

Conditional rendering

Earlier, my workflow looked like this:

Start coding UI

Realize backend needs tweaking

Modify API

Break another screen

Add custom validation

Duplicate logic “just this once”

Fix edge cases at the end (panic phase)

❌ That’s not development.
✅ That’s damage control.

🔄 What Changed This Time
♻️ Reusable Thinking > Custom Thinking

The biggest shift came from one question:

“Have I already solved 80% of this problem somewhere else?”

Turns out — I had.

Similar forms

Similar validations

Same API response shape

I wasn’t missing code.
I was missing reuse discipline.

⚙️ Automating the Boring Middle

I stopped hand-wiring things I could standardize:

Form state

Validation rules

API error mapping

Once these become predictable,
features stop being scary.

⏳ The 1-Hour Build (Step by Step)
🧠 Step 1: Define Inputs & Outputs (10 minutes)

Before coding, I answered:

What data goes in?

What shape comes out?

What can fail?

I wrote this in plain English first.

No IDE.
No distractions.
Just clarity.

♻️ Step 2: Reuse Before You Write (15 minutes)

I reused:

An existing form component

A shared validation schema

A common API wrapper

No pride.
No “I’ll clean it later”.

🧱 Step 3: Thin Backend, Smart Frontend (20 minutes)

Instead of creating custom endpoints, I used:

A generic POST handler

Config-driven behavior

🧠 Less backend code = fewer surprises.

🧪 Code Example (Simplified)

Here’s the pattern that saved me time — config-driven forms.

// formConfig.ts
export const userFormConfig = {
  fields: [
    { name: "email", type: "email", required: true },
    { name: "role", type: "select", options: ["admin", "user"] }
  ],
  endpoint: "/api/users"
};

// ReusableForm.tsx
function ReusableForm({ config }) {
  const { fields, endpoint } = config;

  return (
    <form onSubmit={(data) => api.post(endpoint, data)}>
      {fields.map(field => (
        <Input key={field.name} {...field} />
      ))}
    </form>
  );
}

✨ This isn’t fancy.
🔁 It’s repeatable — and repeatability is speed.

🧠 Best Practices I Learned the Hard Way

Design patterns, not features

Write code assuming you’ll reuse it next week

If it feels repetitive → it deserves abstraction

Time spent thinking upfront saves hours later

“Simple” features expose bad architecture fast

⚠️ Common Pitfalls (I’ve Fallen Into All of These)

Over-customizing too early

Ignoring existing utilities

Mixing business logic into UI

Coding for today, not the next 5 features

Refactoring after shipping instead of before starting

💬 Community Corner

I’m curious 👇

What feature surprised you by being much faster than expected?

What abstraction saved you the most time?

Do you prefer config-driven reuse or explicit code?

Drop your stories, patterns, or counter-arguments in the comments.
Different teams solve this differently — and that’s the fun part.

❓ FAQ

Was this because of AI tools?
No. This was about architecture and reuse — not autocomplete.
Is this approach good for startups?
Especially for startups. Speed + consistency matters most there.
Doesn’t abstraction slow you down initially?
Yes. Once. Then it pays you back repeatedly.
What if requirements change?
Config-driven designs adapt faster than hardcoded flows.
Is this more frontend or backend focused?
Both — but frontend benefits immediately.
Can juniors apply this?
Absolutely. Start small: reuse one component at a time.
What’s the biggest takeaway?
👉 Think in systems, not tasks.

🎯 Conclusion

That 1-hour feature wasn’t luck.

It was the result of:

Fewer decisions

Better reuse

Respecting my future self’s time

If every feature feels heavier than it should,
don’t work faster — work differently.

If this resonated, give it a ❤️, share it with your team,
or follow me for more real-world dev lessons —
no fluff, just scars and solutions.

🔗 References

React Docs – Reusability Patterns: https://react.dev

Martin Fowler on Refactoring: https://martinfowler.com

Clean Architecture Overview: https://8thlight.com/insights/clean-architecture

Understanding Agentic AI: How Modern Systems Make Autonomous Decisions

Shruthi Chikkela — Sun, 14 Dec 2025 21:53:04 +0000

What Is Agentic AI? A Practical, Real‑World Introduction for Developers

If you are a developer, DevOps engineer, or cloud professional, chances are you’ve already built systems that behave a little like agents — you just didn’t call them that.

Agentic AI is not science fiction, not sentient machines, and not a replacement for engineering discipline. It is simply software that can decide what to do next in order to achieve a goal.

In this post, we’ll break down Agentic AI from first principles — clearly, realistically, and without hype — using examples that make sense for real production systems.

Why Agentic AI Is Suddenly Everywhere

You can paste this directly under that heading in your dev.to article.

Why Agentic AI Is Suddenly Everywhere

Agentic AI didn’t appear overnight.

It’s the result of how software systems have evolved over the last decade, especially in cloud, DevOps, and large-scale distributed environments.

To understand why agentic AI is everywhere today, we need to look at how we’ve historically handled operations and decision-making in software systems.

Phase 1: Manual Operations — Humans Run Commands

Not too long ago, most systems were operated manually.

A typical workflow looked like this:

A system misbehaves
An alert fires
An engineer logs into a server
Commands are run by hand
Fixes are applied based on experience

This model relied heavily on:

human judgment
tribal knowledge
runbooks and documentation

It worked — but it did not scale.

As systems grew larger:

more services
more environments
more dependencies

Humans became the bottleneck.

Every decision depended on:

who was on call
how experienced they were
how quickly they could reason under pressure

This was the first pain point.

Phase 2: Automation — Scripts and Pipelines

To reduce manual work, we introduced automation.

Examples you already know well:

Bash / PowerShell scripts
CI/CD pipelines
Terraform and ARM templates
Ansible, Chef, Puppet
Scheduled jobs and cron tasks

Automation was a massive improvement.

Instead of:

“Log in and fix it”

We moved to:

“If X happens, do Y”

This brought:

speed
consistency
repeatability

But automation has a hard limitation:

It only works for scenarios you explicitly planned for.

Automation assumes the world behaves predictably.

The Cracks in Traditional Automation

As systems became cloud-native and distributed, automation started failing in subtle but painful ways.

Consider real-world scenarios:

A restart fixes the issue sometimes
Scaling helps only during peak hours
A fix works in one region but breaks another
A dependency fails intermittently
Metrics contradict each other

Automation doesn’t reason.
It doesn’t ask:

“Did that action help?”
“Should I try something else?”
“Is this situation similar to past incidents?”

When automation hits an unexpected state, it stops — and hands control back to humans.

This is where modern systems started to outgrow static rules.

Phase 3: Intelligent Automation — Systems That Decide What to Do

This is where agentic AI enters.

Instead of encoding every possible decision upfront, we started asking a different question:

“Can the system decide what to do next based on the current situation?”

This is intelligent automation.

The system:

observes what’s happening
reasons about possible actions
chooses one
evaluates the result
adjusts if needed

This decision-making loop is exactly what humans do during incidents — just much faster and more consistently.

Agentic AI sits squarely in this third phase.

Why This Shift Is Happening Now

Agentic AI is not popular because of hype alone.
It exists because modern systems forced us into it.

Let’s look at the realities of today’s production environments.

1. Systems Are Distributed

Modern applications are no longer:

a single server
a single database
a single failure point

They are:

microservices
message queues
managed cloud services
third-party APIs
multi-region deployments

Failures are rarely isolated.

A single alert might be a symptom, not the cause.

Static automation struggles because:

it sees one signal
it acts in isolation
it lacks system-wide context

Agentic systems can reason across multiple signals and dependencies.

2. Systems Are Noisy

Modern observability generates:

thousands of metrics
millions of logs
endless alerts

Not every alert matters.
Not every spike is a problem.

Humans are good at pattern recognition.
Scripts are not.

Agentic AI helps by:

correlating signals
filtering noise
prioritizing what actually matters

This is why agentic approaches are exploding in:

alert triage
incident management
security monitoring

3. Systems Are Constantly Changing

In cloud environments:

infrastructure scales automatically
deployments happen daily
configurations drift
dependencies evolve

Static rules age quickly.

A rule written six months ago may no longer be valid today.

Agentic AI adapts because it:

evaluates outcomes
adjusts decisions
works with current state, not assumptions

This makes it suitable for living systems, not static ones.

Why Static Rules Are No Longer Enough

Static rules assume:

predictable behavior
limited variability
known failure modes

Modern systems violate all three.

Agentic AI does not replace rules —
it operates above them, deciding which rule or action to apply and when.

Think of it this way:

Automation executes
Agents decide

A DevOps Perspective (Very Important)

Agentic AI is not trying to replace:

engineers
automation tools
infrastructure-as-code

It is trying to replace:

repetitive decision-making
cognitive overload
slow human reaction loops

From a DevOps point of view, agentic AI is:

An on-call assistant that never sleeps, reasons consistently, and knows when to escalate.

A Simple Definition You Can Remember

One of the biggest problems with Agentic AI is not the technology —
it’s the lack of a clear, usable definition.

Most definitions you see online are either:

too academic to be practical, or
too vague to be meaningful

As engineers, we need definitions that help us design systems, not just talk about them.

So let’s define Agentic AI in a way that actually works in real projects.

A Practical Definition (Not Marketing)

Agentic AI is software that can pursue a goal by observing its environment, deciding what to do next, taking actions through tools, and evaluating the outcome.

This definition is important because every word has engineering meaning.

Let’s break it down slowly.

“Software That Can Pursue a Goal”

This is the most important part.

Traditional software executes instructions.
Agentic software pursues outcomes.

Compare the two:

Instruction-based:

“Restart the service”

Goal-based:

“Restore system reliability without causing user impact”

The second statement allows multiple valid paths:

restart
scale
fail over
roll back
do nothing and observe

Agentic AI exists to choose between these paths.

“Observing Its Environment”

Agents do not operate blindly.

They continuously observe:

system metrics
logs
traces
API responses
external signals

This is no different from what a DevOps engineer does during an incident:

check dashboards
read logs
correlate symptoms

The difference is speed and consistency, not intelligence.

If a system cannot observe state, it is not an agent — it’s just a script.

“Deciding What to Do Next”

This is where agentic systems differ fundamentally from automation.

Automation follows a predefined path:

If A → do B

Agents ask:

Given what I see right now, what action makes the most sense?

This decision can involve:

comparing options
weighing risks
checking constraints
learning from past outcomes

This is runtime decision-making, not compile-time logic.

“Taking Actions Through Tools”

Agents do not act directly on the world.

They use tools — just like humans.

In real systems, tools are:

Azure CLI
Kubernetes API
GitHub Actions
Terraform
REST APIs
Internal services

This point matters a lot.

If an “AI system” cannot actually do anything, it is not agentic — it’s advisory at best.

“Evaluating the Outcome”

This is the part most people miss.

After acting, an agent asks:

Did this help?
Did the metric improve?
Did the error rate drop?
Did latency stabilize?

Without evaluation, there is no learning.
Without learning, there is no agency.

This feedback loop is what allows:

retries
alternative strategies
escalation to humans

The Core Agent Loop (Again, Because It Matters)

Every real agent follows this loop:

Observe → Decide → Act → Evaluate

If you remember this loop, you can:

identify agentic systems
design your own
avoid fake “agent” hype

What Agentic AI Is NOT (Very Important)

To avoid confusion, let’s be explicit.

Agentic AI is not:

❌ A chatbot answering questions
❌ A single ML model
❌ A prompt with multiple steps
❌ A replacement for engineers
❌ A system without guardrails

Many products today are labeled “agents” but only satisfy one or two parts of the loop.

That does not make them agentic systems.

A Layman Example (Non-Technical)

Imagine a personal assistant.

A basic assistant:

waits for instructions
executes exactly what you say

An agentic assistant:

understands your goal (“get me to the airport on time”)
checks traffic
monitors flight updates
suggests leaving early
reroutes if needed

Same tools.
Same environment.
Different level of autonomy.

That difference is agency.

A Real DevOps Example

Let’s ground this in reality.

Goal: Keep a web application available.

An agentic system might:

detect increased latency
analyze recent deployments
check resource utilization
decide whether to scale or roll back
apply the action
verify user experience metrics

At no point did a human say:

“Do step 1, then step 2, then step 3”

The human defined the goal and constraints.
The agent handled the decisions.

Why This Definition Matters

This definition helps you answer practical questions like:

Should I use an agent here?
Is my system truly agentic?
Where do I limit autonomy?
Where do humans stay involved?

Without a clear definition, teams either:

overbuild agents where they aren’t needed, or
fear them where they would help the most

Key Takeaway (Memorable)

If you remember one thing from this section:

Agentic AI is about decision-making autonomy, not intelligence.

It’s not smarter software.
It’s more responsible software — when designed correctly.

A DevOps Analogy: You’ve Already Built “Agents” (Without Calling Them That)

One of the reasons Agentic AI feels confusing is because it’s often presented as something completely new.

In reality, DevOps engineers have been moving toward agent-like systems for years.

Let’s walk through a familiar scenario — no AI required.

The Traditional On-Call Workflow

Imagine a production incident at 2 a.m.

A service becomes slow or unavailable.

What happens next?

Monitoring system fires an alert
On-call engineer receives notification
Engineer opens dashboards
Logs are inspected
Metrics are correlated
A hypothesis is formed
An action is taken
Results are observed
More actions are taken if needed

This process is not random.

It is a decision loop driven by:

goals (restore service)
observations (metrics, logs)
actions (restart, scale, rollback)
feedback (did it work?)

Humans are acting as agents here.

What Automation Changed (and Didn’t)

Automation helped us reduce manual effort.

Instead of typing commands, we wrote:

scripts
pipelines
runbooks
auto-scaling rules

This improved speed and consistency.

But notice something important:

Automation usually handles execution, not decision-making.

A script does exactly what it’s told.
A pipeline follows a fixed path.
An auto-scaler reacts to one metric.

When conditions change unexpectedly, automation stops — and humans step back in.

Where Humans Still Do the Hard Work

Even in highly automated environments, humans still handle:

interpreting noisy alerts
deciding which signal matters
choosing between multiple fixes
stopping automation when it causes harm

This is the hard part of operations.

And this is exactly where agentic AI is applied.

Agentic AI as a “Junior On-Call Engineer”

A good way to think about agentic AI is this:

Agentic AI is like a junior on-call engineer who follows runbooks, observes systems, tries safe actions, and escalates when unsure.

Not a senior architect.
Not an all-knowing system.

A careful, limited, supervised decision-maker.

This framing is important because it sets realistic expectations.

How an Agent Fits Into the Same Workflow

Let’s revisit the same incident — now with an agent involved.

Alert fires
Agent collects metrics and logs
Agent matches patterns from past incidents
Agent selects a low-risk action
Agent executes via approved tools
Agent observes outcome
Agent either:

stops (success), or
tries an alternative, or
escalates to a human

Nothing magical happened.

The difference is who is making the routine decisions.

Why This Matters at Scale

This analogy becomes critical at scale.

When you have:

hundreds of services
multiple regions
frequent deployments
24/7 operations

Human decision-making does not scale linearly.

Agentic systems help by:

handling common patterns
reducing alert fatigue
speeding up recovery
keeping humans focused on complex cases

This is not about replacing engineers.
It’s about using engineers where they add the most value.

The Key Insight From the DevOps Analogy

Agentic AI is not a new class of software.

It is a shift in responsibility:

Automation executes actions
Agents decide which actions to execute
Humans define goals, constraints, and oversight

Once you see this, agentic AI stops being mysterious.

A Subtle but Important Point

If you remove AI entirely and implement:

dynamic decision trees
feedback loops
state evaluation
escalation logic

You are already building an agentic system.

LLMs simply make:

reasoning more flexible
logic less brittle
adaptation easier

But the architecture comes first.

Key Takeaway

If you remember one thing from this section:

Agentic AI automates decision-making, not responsibility.

Responsibility stays with engineers.
Agents just reduce the manual thinking load.

The Core Agent Loop: Observe → Decide → Act → Evaluate

At the heart of every agentic system is a simple, repeatable loop:

Observe → Decide → Act → Evaluate

This loop may look simple on paper, but understanding it deeply is key for designing practical, reliable agentic systems.

Step 1: Observe — Understanding the Environment

Observation is the first step. The agent must know what is happening before it acts.

In DevOps and cloud systems, observations typically include:

Metrics (CPU, memory, latency)
Logs (error messages, events)
Traces (request flows, service calls)
API responses from services
External signals (alerts, third-party integrations)

Example:

A Kubernetes cluster experiences higher latency.
The agent observes:

Pod CPU usage is high
Memory usage is within limits
Deployment history shows a new rollout

Observation gives context for the next decision.

Without accurate observation, the agent cannot reason — it’s blind.

Step 2: Decide — Choosing the Best Action

Next comes decision-making. The agent decides what to do next based on:

The goal (e.g., “restore service availability”)
Observed state
Constraints (risk thresholds, cost limits)
Past experience (previous actions and outcomes)

Example Decision Options:

Restart a pod
Scale the deployment
Rollback recent changes
Notify human operators

The agent evaluates trade-offs:

Will scaling help latency without overspending resources?
Will rollback disrupt ongoing user requests?

This is reasoning, not random action.
It mirrors what an engineer does — just automated.

Step 3: Act — Executing Through Tools

Once the decision is made, the agent executes the chosen action using tools:

Azure CLI commands to scale resources
Kubernetes API to restart pods
Terraform to modify infrastructure
Internal scripts for database maintenance
Webhooks or APIs for notifications

Key point: The agent does not act magically.
It interacts with the real system through the same mechanisms humans would use — just faster and more reliably.

Step 4: Evaluate — Feedback and Learning

After acting, the agent must check the result:

Did the latency improve?
Did errors decrease?
Was the change safe for users?
Should the action be reversed?

Example:

If scaling did not reduce latency:

The agent may try restarting pods instead
Or escalate to a human operator

Evaluation ensures:

The system learns from outcomes
Actions are validated
Failures are caught before they propagate

Without evaluation, you have automation, not agency.

Why This Loop Is So Powerful

It creates autonomy: The agent can handle many small decisions without human intervention.
It enables adaptation: The agent responds dynamically to changing environments.
It allows learning: Feedback ensures the system improves over time.
It scales operations: Hundreds of microservices or cloud regions can be monitored and managed simultaneously.

In short, this loop is the secret sauce that separates static automation from intelligent agents.

DevOps Analogy: Incident Response at Scale

Imagine a production incident across multiple regions:

Observe: Agent collects metrics from all regions, logs, and alerts.
Decide: Determines that Region A needs scaling, Region B needs pod restart.
Act: Executes actions through Azure/Kubernetes APIs.
Evaluate: Checks metrics to verify response; escalates only if unresolved.

Humans no longer make routine decisions — they focus on complex, strategic choices.

Key Takeaways

Every agent follows Observe → Decide → Act → Evaluate.
Observation and evaluation are as important as action.
Autonomy does not mean “no human oversight.” It means smart delegation of repetitive decisions.
Understanding this loop is critical before building or evaluating any agentic system.

Breaking Down the Core Components of an Agentic System

Now that we understand the agent loop — Observe → Decide → Act → Evaluate —
it’s time to look at what actually makes an agent work.

Every agentic system, whether in DevOps, cloud automation, or research workflows, has five core components:

Goal
Observation
Reasoning / Decision-making
Tools / Actions
Memory / Feedback

We’ll break each down in detail with real-world examples.

1. Goal: The North Star of the Agent

Every agent needs a goal. Without it, it is directionless.

Definition: The goal defines what the agent is trying to achieve.

Why it matters:

It ensures that every decision aligns with desired outcomes.
It allows flexibility in choosing how to achieve the goal.

Example in DevOps:

Goal: “Restore system availability within 5 minutes”
The agent can:
- Restart failing services
- Scale resources dynamically
- Roll back recent deployments

Notice: The goal doesn’t prescribe steps, only the desired state.
This is key to autonomy.

2. Observation: Understanding the Environment

Observation is the data intake stage of the agent.

What it observes:

Metrics: CPU, memory, latency, error rates
Logs: system, application, security
Traces: request flows, dependency graphs
External inputs: alerts, API responses, monitoring tools

Example:
An agent monitoring a Kubernetes cluster notices:

Pod CPU is at 95%
Memory usage is 60%
Recent deployments included a new container image

Observation provides context for reasoning.

3. Reasoning / Decision-Making: Choosing the Next Action

Reasoning is the agent’s thinking step.

It decides:

Which action best achieves the goal
Which trade-offs are acceptable
Whether to escalate or retry

Example Decisions:

Scale up pods by 2 vs. restart failing pods
Delay action due to ongoing deployments
Escalate to human on-call if uncertainty is high

Reasoning is structured, not human-like intelligence.
It’s comparable to following a dynamic runbook.

4. Tools / Actions: How the Agent Executes

Agents don’t magically fix systems — they use tools to act.

Common DevOps / Cloud tools agents interact with:

Azure CLI or PowerShell for cloud resources
Kubernetes API for container orchestration
Terraform / ARM templates for infrastructure changes
GitHub Actions or CI/CD pipelines for deployment tasks

Example:

An agent detects high latency → scales pods using Kubernetes API → verifies metrics → escalates if unresolved

The key point: the agent interacts with real systems just like humans do, but faster and more consistently.

5. Memory / Feedback: Learning from Outcomes

Memory allows the agent to avoid repeating mistakes and improve decisions.

Types of memory:

Short-term: current task context (e.g., already tried restarting pod)
Long-term: historical patterns (e.g., a previous deployment caused similar latency spikes)

Feedback:
After acting, the agent evaluates the results:

Did CPU usage drop?
Did latency improve?
Was the service restored?

This feedback loop ensures continuous improvement, even without retraining models from scratch.

Putting It All Together: A Real-World Example

Imagine an agent managing an e-commerce platform:

Goal: Keep checkout service uptime > 99.9%
Observation: Collects metrics, logs, recent deployment info
Decision: Detects spike in latency; decides to scale pods and restart failing containers
Action: Executes Kubernetes API commands, applies scaling rules
Memory / Feedback: Notes which pods were restarted, verifies latency drop, escalates if unresolved

Notice how each component directly maps to the agent loop we discussed earlier.

Key Takeaways

Agentic systems are structured and predictable, not magical.
Goals, observation, reasoning, tools, and memory are the building blocks.
Real-world examples show how these components fit naturally in DevOps/cloud workflows.
Understanding these components is crucial before trying to build an agentic AI system.

Agentic AI vs Traditional Automation

At this point, you understand what an agent is and its core components.
Now it’s important to see how it differs from traditional automation, because many teams confuse the two.

Traditional Automation: Execution Only

Automation has been around for decades. Examples you already know:

Scripts for deployments (Bash, PowerShell, Python)
CI/CD pipelines (Jenkins, GitHub Actions, Azure DevOps pipelines)
Infrastructure-as-Code (Terraform, ARM templates)
Scheduled jobs and cron tasks

Key characteristics:

Predictable: Automation follows a fixed path.
Rule-based: It executes pre-defined instructions.
Non-adaptive: If the scenario changes, automation fails.
No feedback reasoning: It does not decide next steps based on outcome.

Example:
A script restarts a service when CPU exceeds 90%.

Works if the problem matches the expected scenario.
Fails if the real issue is a stuck process in a dependent service.

Traditional automation is powerful, but limited by what we explicitly encode.

Agentic AI: Decisions on Autopilot

Agentic AI sits above automation:

Observes the system (metrics, logs, alerts)
Chooses the best action based on goals and context
Executes actions using the same tools as automation
Evaluates the outcome and adapts

Example in DevOps:
Goal: “Restore web service uptime.”

Agent observes latency and errors across regions
Determines which region has failing pods
Decides to scale or restart pods based on historical success
Executes action via Kubernetes API
Verifies system health; escalates if necessary

Here, automation is a subset — the agent may call scripts or APIs, but it decides which one to call and when.

Comparing the Two: Key Differences

Feature	Traditional Automation	Agentic AI
Decision-making	None (fixed instructions)	Autonomous (evaluates options)
Adaptability	Low	High
Feedback loop	Manual or scripted	Built-in evaluation & learning
Use cases	Repetitive, predictable tasks	Complex, multi-step, dynamic tasks
Human reliance	Always needed for unexpected cases	Reduced for routine decisions

Why It Matters in Real Projects

In small, predictable systems, traditional automation is sufficient.
But in modern cloud-native environments:

Microservices interact in complex ways
Traffic patterns fluctuate constantly
Deployments happen multiple times per day
Multiple regions and dependencies exist

Automation alone cannot adapt. Static rules break under real-world complexity.

Agentic AI allows teams to:

Reduce incident response time
Scale operations without linearly increasing human effort
Apply reasoning to dynamic, multi-step processes
Keep humans focused on higher-value decisions

A DevOps Analogy: Automation vs Agentic AI

Scenario: Service latency spikes.

Automation: Predefined script runs → restarts pod → done
Agentic AI: Observes latency, checks logs, evaluates recent deployments, chooses safest action (restart, scale, rollback), executes, verifies, escalates if needed

The difference: automation executes; agent decides.

Key Takeaways

Automation is execution; agentic AI is decision-making on top of execution.
Agents are adaptive and can reason about next steps; automation cannot.
Real-world systems are too complex for static rules, which is why agentic AI is increasingly relevant.
Understanding this distinction is crucial before designing workflows — not every task needs an agent.

Real-World Use Cases of Agentic AI

Now that we understand what agentic AI is and how it differs from traditional automation, it’s time to see how it applies in real projects.
These examples are grounded in DevOps, cloud operations, and enterprise systems — not abstract theory.

1. Cloud Incident Response

Problem: In a multi-region cloud deployment, services occasionally experience downtime or latency spikes. Manual intervention is slow and stressful, especially during off-hours.

Traditional approach:

Alerts fire to on-call engineers
Engineers diagnose using dashboards, logs, and metrics
Apply a fix (restart pod, scale resources, rollback deployment)
Verify service recovery

Challenges:

Time-consuming
Human error under pressure
Scaling issue: hundreds of services may be affected simultaneously

Agentic AI approach:

Observes all metrics, logs, and alerts in real-time
Diagnoses root cause automatically using past incident data
Chooses and executes the safest remediation (scale, restart, rollback)
Evaluates whether the service has recovered
Escalates to human only if needed

Impact:

Faster resolution times
Reduced alert fatigue for engineers
Consistent and repeatable response across regions

2. Cloud Cost Optimization

Problem: Cloud resources often sit underutilized, leading to unnecessary spend.

Traditional approach:

Engineers run reports
Identify over-provisioned resources
Manually resize or delete

Challenges:

Manual review is tedious
Risk of accidental downtime
Scaling this across hundreds of resources is difficult

Agentic AI approach:

Observes usage patterns, cost trends, and resource metrics
Identifies underutilized VMs, storage, or containers
Proposes actions or automatically applies safe changes
Verifies service performance post-change
Adjusts strategy over time

Impact:

Reduced cloud spend
Continuous optimization without manual effort
Safe, controlled execution with fallback mechanisms

3. Security Monitoring and Triage

Problem: Enterprise systems generate thousands of alerts daily.
Humans cannot investigate all alerts in real-time.

Traditional approach:

Security analysts manually triage alerts
Investigate logs and correlate events
Escalate or remediate incidents

Challenges:

High alert fatigue
Risk of missing critical threats
Slow response times

Agentic AI approach:

Observes security logs, anomaly signals, and external threat intelligence
Classifies alerts based on severity
Correlates related events automatically
Executes safe remediation for routine threats
Escalates only critical incidents

Impact:

Faster threat detection and resolution
Reduced burden on analysts
Fewer false positives and missed events

4. Research or Data Pipeline Automation

Problem: Researchers or data engineers often run multi-step workflows with dependencies (ETL, data validation, model training).

Traditional approach:

Predefined scripts and cron jobs
Failures require manual inspection and rerun

Challenges:

Complex dependencies
High failure recovery overhead
Inefficient use of human time

Agentic AI approach:

Observes the state of datasets, pipelines, and compute resources
Decides which steps to execute, in what order, and when
Handles failures autonomously (retry, skip, alert)
Maintains logs and adapts strategy for future runs

Impact:

Reliable pipeline execution
Reduced manual intervention
Better reproducibility and auditability

Key Takeaways From Use Cases

Agentic AI excels in dynamic, multi-step workflows.
It reduces human cognitive load, allowing engineers to focus on complex decisions.
Real-world deployments often combine existing automation with agentic decision-making — agents rarely replace tools entirely.
Success depends on goals, feedback loops, and safe execution.

These examples show that agentic AI is practical, not theoretical.
It’s already being applied to incident management, cost optimization, security, and data pipelines — exactly where dynamic decision-making adds value.

Where Agentic AI Actually Makes Sense — and Where It Doesn’t

Understanding when to use agentic AI is just as important as understanding what it is.
Not every workflow benefits from an agent, and deploying one where it isn’t needed can add complexity, cost, and risk.

Let’s break it down from a practical, DevOps/cloud perspective.

When Agentic AI Makes Sense

Agentic AI is ideal when the workflow is complex, dynamic, or multi-step, and human intervention is slowing things down.

Key criteria:

Multi-Step Workflows

Tasks that involve multiple steps or dependencies benefit from agentic reasoning.
Example: Incident response where logs, metrics, and deployments must all be evaluated before action.

Dynamic Environments

Systems that constantly change — cloud-native applications, microservices, multi-region deployments.
Example: Auto-scaling decisions across Kubernetes clusters with fluctuating workloads.

Unpredictable Edge Cases

Situations where hard-coded automation scripts fail due to unexpected conditions.
Example: A new third-party API integration causing intermittent failures — agent evaluates options instead of blindly executing a script.

High Volume / 24/7 Operations

Environments with continuous activity, where humans cannot monitor everything.
Example: Security monitoring with thousands of alerts per day — agent filters, triages, and escalates critical events.

Feedback-Driven Processes

Workflows where outcomes matter and decisions should adapt based on results.
Example: Cloud cost optimization — scaling down resources based on utilization trends, then observing impact.

When Agentic AI Does NOT Make Sense

Not all processes require agents. In fact, applying agentic AI unnecessarily can introduce risk and overhead.

Avoid using agents when:

Simple, Predictable Tasks

If a script or cron job can reliably execute a task, don’t overcomplicate.
Example: Scheduled backup of a database or routine file cleanup.

Deterministic Workflows

Where every step has a fixed, known outcome.
Example: CI/CD pipeline that builds, tests, and deploys a single service in a controlled environment.

Strict Compliance / Regulatory Constraints

Some actions must follow a strict sequence with audit requirements.
Example: Financial transactions or regulated healthcare data processing.

Low-Risk / Low-Impact Tasks

If a failure costs little and can be easily corrected, a human or simple automation may suffice.

Where Observability is Lacking

If the agent cannot reliably observe the environment or measure outcomes, it cannot make informed decisions.

Practical Tip: Hybrid Approach

Most successful deployments use a hybrid model:

Agent handles routine, repetitive, or time-critical decisions.
Humans remain in the loop for complex, strategic, or high-risk actions.

Example:

Agent: Restarts failing pods, scales clusters, optimizes costs
Human: Approves production deployments, reviews unusual security incidents, decides on architecture changes

This keeps humans in control while leveraging the speed and consistency of agents.

Key Takeaways

Agentic AI is not a silver bullet — it’s a tool for the right context.
Focus on areas where automation fails due to complexity or unpredictability.
Use hybrid approaches to balance autonomy and oversight.
Misusing agentic AI can increase risk and operational overhead rather than reduce it.

Advantages and Disadvantages of Agentic AI

After understanding what agentic AI is, its core components, and where it makes sense, let’s examine the pros and cons from a real-world engineering perspective.

Advantages

Reduced Human Intervention

Agents handle routine, repetitive, and time-sensitive tasks automatically.
Example: Automatically scaling a Kubernetes cluster when load spikes, without waking an on-call engineer at 2 a.m.

Adaptability

Agents can reason about dynamic environments and adjust actions based on observations.
Example: Adjusting deployment strategies based on current system load or metrics anomalies.

Faster Response Times

By continuously monitoring and acting, agents can resolve incidents minutes faster than humans.
Critical in production systems where downtime directly affects revenue or user experience.

Scalable Decision-Making

One agent can monitor hundreds of services simultaneously, something impossible for a human team to do consistently.

Knowledge Retention

Agents remember past actions, successes, and failures.
Example: An agent won’t retry a failing remediation strategy that didn’t work last time, improving reliability.

Disadvantages & Risks

Unpredictability

Agents make decisions dynamically. Without proper guardrails, they might choose unexpected actions.
Example: Restarting a dependent service instead of the actual failing pod.

Cost

Running agentic AI, especially with large-scale monitoring and reasoning, can incur compute, storage, and API costs.
Example: Continuous evaluation of metrics across hundreds of resources in Azure or AWS.

Debugging Complexity

When an agent fails or makes a poor decision, tracing root cause can be challenging compared to static scripts.

Security Risks

Agents often require privileged access to execute tasks.
Misconfigured or malicious prompts could lead to unauthorized actions, data leaks, or infrastructure misuse.

Requires Proper Observability

Agents depend on accurate metrics, logs, and monitoring. Without high-quality observability, decisions may be wrong or unsafe.

Balancing Advantages and Risks

The key to success is controlled deployment:

Limit agent autonomy to low-risk actions initially.
Keep humans in the loop for critical or high-impact decisions.
Log every decision for transparency and auditing.
Continuously review performance and improve rules and feedback loops.

In short: Agentic AI is powerful, but only when deployed thoughtfully.

Agentic AI is not magic.
It’s an evolution of automation, giving software the ability to make decisions toward a goal while humans focus on strategy and oversight.

From DevOps to cloud operations, security, and data pipelines, agentic AI is already transforming the way teams handle complex, dynamic environments.

By understanding its loop, core components, advantages, and risks, you can design systems that are safe, adaptive, and effective.

💬 Discussion

If you’re a DevOps or cloud engineer, think about this:

Which tasks in your workflow could an agent handle autonomously?
Where would you insist on human approval?

I’d love to hear your thoughts in the comments!

Follow @learnwithshruthi for More Agentic AI Insights

If you found this article useful, follow me for the full 30-day agentic AI blog series, where we’ll cover:

Agentic AI vs Chatbots vs AI Assistants
Building agentic systems on Azure and Kubernetes
Real-world patterns, tips, and best practices
Hands-on examples and tutorials

#AgenticAI #DevOps #CloudAutomation #Azure #Kubernetes #AIinProduction #IntelligentAutomation #TechBlog #SoftwareEngineering #Observability #IncidentManagement #careerbytecode @cbcadmin

# A Failed Compliance Audit in Azure DevOps: Rebuilding CI/CD with Policy as Code and Security Gates

Raghavendra R — Sun, 07 Dec 2025 13:18:13 +0000

Rebuilding Azure DevOps CI/CD for Compliance

Rebuilding Azure DevOps CI/CD for Compliance
- Introduction
Core Concepts
- Compliance in Azure DevOps: Where It Lives
- Policy as Code: Three Levels
- Security Gates in Azure DevOps
- Multi-Environment, Multi-Subscription Design
Step-by-Step Guide
- 1. Map Audit Findings to Concrete Controls
- 2. Standardize CI/CD Architecture
- 3. Implement Template-Driven CI Pipelines
- 4. Embed Policy as Code for Infrastructure
- 5. Define Environments and Security Gates
- 6. Integrate Security Scanners as Gates
- 7. Observability and Auditability
- 8. Rollout Strategy Across Teams
Architecture & Flow Diagram
Best Practices
Common Pitfalls
- 1. "Templates" That Are Optional
- 2. Over-Permissive Service Connections
- 3. Scanners That Don't Fail Builds
- 4. Manual Change Approvals Outside CI/CD
- 5. Azure Policy Not Integrated with CI
- 6. Ignoring Non-Prod Environments
- 7. No Runbooks for Gate Failures
FAQ
- 1. How does this map to AWS and GCP?
- 2. How do I add compliance without slowing delivery?
- 3. How can I scale this across dozens of teams?
- 4. How do I handle legacy applications and pipelines?
- 5. How do I integrate with ITSM and change management?
- 6. What KPIs show that CI/CD compliance is working?
- 7. How do I handle multi-region or DR scenarios?
- 8. What's the role of GitHub if we already use Azure DevOps?
Conclusion
References

Introduction

A failed compliance audit on an Azure DevOps–backed delivery stack usually exposes the same issues: ad-hoc pipelines, inconsistent checks across projects, manual approvals in emails, and no traceable mapping between controls and the CI/CD implementation.

Rebuilding CI/CD in Azure DevOps with policy as code and security gates turns your pipeline into an auditable control plane:

Compliance requirements become versioned, testable artifacts.
Every build and deployment path is governed by the same rules.
Approvals, scans, and checks are enforced centrally instead of relying on tribal knowledge.

This article focuses on:

Translating compliance controls (ISO 27001, SOC 2, PCI, etc.) into Azure DevOps pipeline constructs.
Implementing policy as code across infrastructure, application, and pipeline configuration.
Designing security and compliance gates using Azure DevOps Environments, Approvals & Checks, and integrated scanners.
Rolling out these patterns across dev/qa/stage/prod at enterprise scale.

The primary cloud context is Azure (Azure DevOps + Azure platform), with brief mappings to AWS/GCP where useful.

Core Concepts

Compliance in Azure DevOps: Where It Lives

In an Azure-centric environment, compliance controls surface in four main areas:

Source control & change management
- Azure Repos or GitHub (with Azure DevOps pipelines).
- Branch policies, PR workflows, commit history.
- Required linked work items and change records.
CI/CD pipelines
- Azure Pipelines (YAML) as the automation backbone.
- Template-based pipelines shared across teams.
- Build, test, scan, deploy, and approval flows.
Infrastructure and configuration
- Infrastructure as Code (Terraform, Bicep, ARM).
- Azure Policy for runtime governance.
- Secret management in Azure Key Vault; access via Managed Identity.
Runtime environments
- AKS, App Service, Functions, Container Apps.
- VNets, subnets, NSGs, private endpoints, Application Gateway/Front Door.
- Azure Monitor, Log Analytics, Application Insights, Defender for Cloud.

A compliant architecture ensures the same controls are applied consistently at each layer, encoded as code/config rather than manual processes.

Policy as Code: Three Levels

Policy as code in Azure DevOps typically spans three levels:

Platform & Azure resource level
- Azure Policy: Deny or audit non-compliant resources (e.g., public IPs, unencrypted disks, missing tags).
- Terraform/Bicep linters & policy engines: OPA/Conftest, Checkov, Terrascan enforcing rules before apply.
- Example mappings:
  - Azure Policy → AWS Config / SCPs, GCP Organization Policies.
  - OPA/Conftest rules are cloud-agnostic and can be reused multi-cloud.
Pipeline level
- Centralized YAML templates containing required stages and jobs:
  - SAST, SCA, container scanning.
  - Infrastructure policy checks before apply.
  - Build provenance and artifact signing (where applicable).
- Restricted patterns:
  - Projects must use approved templates.
  - Limited surface for "inline" pipeline code.
Application level
- Code quality and security standards:
  - SonarQube/SonarCloud quality gates.
  - SAST tools (e.g., GitHub Advanced Security, Snyk, Fortify, etc.).
  - Dependency scanning (SCA) and container vulnerability scanning.
- Organizational policies (minimum code coverage, no critical vulns in prod).

Security Gates in Azure DevOps

Security gates implement "stop points" in CI/CD where policy must be satisfied before progressing:

Environment-based gates
- Azure DevOps Environments (e.g., dev, qa, stage, prod).
- Approvals & Checks bound to environments:
- Manual approvers and groups (segregation of duties).
- Business Hours checks.
- External service checks (e.g., custom API for risk assessment).
- Azure Monitor alerts or service health-based checks.
Quality gates in CI
- SonarQube/SonarCloud "Quality Gate must pass" as a build gate.
- Security scanners configured to fail the build on high/critical findings.
Pre-deployment and post-deployment gates
- Pre-deployment: checks before rollout (compliance scans, change record validation).
- Post-deployment: smoke tests, health checks, synthetic monitoring.

These gates are centralized and auditable: approvers, timestamps, and outcomes are recorded in Azure DevOps and/or Azure logs for evidence.

Multi-Environment, Multi-Subscription Design

For real enterprises, environments are usually split by subscription and/or management group:

mgmt → shared services (DevOps tools, monitoring, policy assignments).
nonprod → dev/qa/stage subscriptions.
prod → production subscriptions.

Azure DevOps interacts via:

Service connections using Managed Identities or service principals.
Environment-specific variables and variable groups or Key Vault references.
Region- and environment-specific policies (e.g., stricter network rules in prod).

The same pipeline definition runs across environments, but gates and policies are tuned per environment via configuration and Azure governance.

Step-by-Step Guide

1. Map Audit Findings to Concrete Controls

Extract failed controls from the audit (e.g., "no evidence that code changes are peer-reviewed").
Map each control to an Azure DevOps / Azure implementation:

Peer review → Pull request policy requiring reviewers.
Change approvals → Environment approvals & work item linkage.
Infrastructure deviations → Azure Policy assignments and IaC validation.
Secrets management → Azure Key Vault + RBAC, no secrets in pipelines.

Build a controls-to-implementation matrix (ideally in a repo):

Control ID
Description
Azure DevOps mechanism (branch policy, pipeline template, gate, etc.)
Azure platform mechanism (Azure Policy, Key Vault, RBAC, etc.)
Evidence location (logs, dashboards, reports).

This matrix drives the rest of the implementation and becomes part of audit evidence.

2. Standardize CI/CD Architecture

Create a platform repo that hosts:

Common pipeline templates (/pipelines/templates/*.yml).
Shared scripts and tooling (/scripts/*).
Policy definitions (/policies/*), e.g., OPA/Conftest rules, Checkov configs.
Documentation for teams on how to onboard.

Example minimal folder structure:

platform-pipelines/
  pipelines/
    templates/
      ci-template.yml
      cd-template.yml
      policy-checks.yml
  policies/
    opa/
    checkov/
  scripts/
    security/
    infrastructure/
  docs/
    controls-matrix.md
    onboarding-guides.md

3. Implement Template-Driven CI Pipelines

Use YAML templates to enforce common CI controls:

# /pipelines/templates/ci-template.yml
parameters:
  - name: runTests
    type: boolean
    default: true
  - name: sonarProjectKey
    type: string
  - name: sonarProjectName
    type: string

stages:
- stage: Build
  jobs:
  - job: Build
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - task: NodeTool@0
      inputs:
        versionSpec: '20.x'
    - script: npm ci
      displayName: Install dependencies
    - script: npm run build
      displayName: Build

    - ${{ if parameters.runTests }}:
      - script: npm test
        displayName: Run unit tests

- stage: Static_Analysis
  dependsOn: Build
  jobs:
  - job: SAST
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - task: NodeTool@0
      inputs:
        versionSpec: '20.x'
    - script: npm ci
      displayName: Install dependencies
    - script: npm run lint
      displayName: Lint
    - task: SonarQubePrepare@5
      inputs:
        SonarQube: 'SonarQube-Connection'
        scannerMode: 'CLI'
        configMode: 'manual'
        cliProjectKey: ${{ parameters.sonarProjectKey }}
        cliProjectName: ${{ parameters.sonarProjectName }}
    - task: SonarQubeAnalyze@5
    - task: SonarQubePublish@5
      inputs:
        pollingTimeoutSec: '300'

Project pipelines reference the template:

# app repo: azure-pipelines.yml
trigger:
  branches:
    include:
      - main

extends:
  template: pipelines/templates/ci-template.yml@platform-pipelines
  parameters:
    runTests: true
    sonarProjectKey: 'my-app-key'
    sonarProjectName: 'My Application'

This ensures every repository:

Implements the same build + SAST structure.
Automatically uses Sonar quality gates.
Is easily updated by modifying the platform template once.

4. Embed Policy as Code for Infrastructure

Assume Terraform for Azure infrastructure:

# Example: Azure Policy assignment via Terraform
resource "azurerm_policy_assignment" "deny_public_ip" {
  name                 = "deny-public-ip"
  scope                = azurerm_resource_group.app_rg.id
  policy_definition_id = data.azurerm_policy_definition.deny_public_ip.id
  enforcement_mode     = "Default"

  display_name = "Deny Public IP Assignment"
  description  = "Policy to deny creation of public IP addresses"
}

# Using a built-in Azure Policy definition
data "azurerm_policy_definition" "deny_public_ip" {
  name = "6c112d4e-5bc7-47ae-a041-ea2d9dccd749"  # Built-in policy ID for "Not allowed resource types"
}

# Alternative: Reference by display name (less reliable)
# data "azurerm_policy_definition" "deny_public_ip" {
#   display_name = "Not allowed resource types"
# }

Add policy checks in CI before terraform apply:

# /pipelines/templates/policy-checks.yml
stages:
- stage: Policy_Checks
  jobs:
  - job: Terraform_Validate
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - script: terraform init
      displayName: Initialize Terraform
    - script: terraform validate
      displayName: Validate Terraform configuration
    - script: terraform plan -out=tfplan
      displayName: Generate Terraform plan

  - job: Policy_Scan
    dependsOn: Terraform_Validate
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - script: |
        checkov -d . --framework terraform --output cli --output junitxml --output-file-path console,results.xml
      displayName: Run Checkov policy scans
    - task: PublishTestResults@2
      condition: always()
      inputs:
        testResultsFormat: 'JUnit'
        testResultsFiles: 'results.xml'
        testRunTitle: 'Checkov Policy Scan Results'

Attach this to your infra repos:

extends:
  template: pipelines/templates/policy-checks.yml@platform-pipelines

If Checkov/OPA finds a policy violation, the pipeline fails, preventing non-compliant infra from being applied, irrespective of who runs it.

5. Define Environments and Security Gates

Create Azure DevOps Environments for:

dev
qa
stage
prod

For each environment:

Configure Approvals & Checks:
- dev: maybe no manual approvals, but require successful policy & security checks.
- qa/stage: manual approvers from QA/SRE; check for linked work item with "Ready for test/Release".
- prod: change-management approver group, CAB-like workflow, and external status checks.

Sample CD stage referencing environments:

# /pipelines/templates/cd-template.yml
stages:
- stage: Deploy_Dev
  dependsOn: [Build, Static_Analysis]
  jobs:
  - deployment: deploy_dev
    environment: 'dev'
    strategy:
      runOnce:
        deploy:
          steps:
          - script: ./scripts/deploy-dev.sh

- stage: Deploy_Prod
  dependsOn: Deploy_Dev
  condition: succeeded()
  jobs:
  - deployment: deploy_prod
    environment: 'prod'
    strategy:
      runOnce:
        deploy:
          steps:
          - script: ./scripts/deploy-prod.sh

Approvals & Checks are configured on the dev and prod environments in the Azure DevOps UI:

prod environment:
- Required approvers group (e.g., "Production Approvers").
- External service check calling a compliance API ("Is this release approved?").
- Business Hours check (no prod deploys outside allowed window).

Azure DevOps records:

Who approved.
When they approved.
What was deployed.

This becomes solid audit evidence.

6. Integrate Security Scanners as Gates

In the CI stage:

SAST and SCA:
- Run on every commit.
- Fail on high/critical severity issues.
Container scanning:
- Scan images before pushing to ACR.
- Fail pipeline if CVEs exceed defined thresholds.

Example snippet:

steps:
- task: SnykSecurityScan@1
  inputs:
    serviceConnectionEndpoint: 'Snyk-Connection'
    testType: 'code'
    severityThreshold: 'high'
    monitorWhen: 'always'
    failOnIssues: true
  displayName: Snyk SAST/SCA

- script: |
    # Install Trivy
    sudo apt-get update && sudo apt-get install -y wget apt-transport-https gnupg lsb-release
    wget -qO - https://aquasecurity.github.io/trivy-repo/deb/public.key | sudo apt-key add -
    echo "deb https://aquasecurity.github.io/trivy-repo/deb $(lsb_release -sc) main" | sudo tee -a /etc/apt/sources.list.d/trivy.list
    sudo apt-get update && sudo apt-get install -y trivy

    # Scan container image
    trivy image --exit-code 1 --severity HIGH,CRITICAL --format sarif --output trivy-results.sarif $(imageName)
  displayName: Container vulnerability scan with Trivy

- task: PublishTestResults@2
  condition: always()
  inputs:
    testResultsFormat: 'VSTest'
    testResultsFiles: 'trivy-results.sarif'
    testRunTitle: 'Trivy Container Security Scan'

In CD:

Ensure the pipeline uses only images from the internal ACR, already scanned and tagged as compliant.

7. Observability and Auditability

Wire CI/CD and runtime to observable sources:

Azure DevOps:
- Audit logs for approvals, permission changes, service connections.
- Pipeline run history, including stage results and logs.
Azure Monitor + Log Analytics:
- Resource changes (Activity Log, Resource Graph).
- Azure Policy compliance dashboard.
- Defender for Cloud / Security Center recommendations.

Create dashboards showing:

% of compliant resources per subscription.
Number of deployments per environment and their success/failure rates.
Mean time to remediate non-compliant resources.

8. Rollout Strategy Across Teams

Start with platform and security-critical services.
Mandate platform templates for any new project.
Migrate existing pipelines in phases:
- Phase 1: Add security scans and approvals.
- Phase 2: Move to shared templates.
- Phase 3: Decommission legacy build/release pipelines.

Use Azure DevOps Project-level governance:

Restrict pipeline creation to templates.
Limit who can modify service connections and environment checks.
Enforce minimal RBAC for service connections (least privilege).

Architecture & Flow Diagram

Best Practices

Centralize pipeline logic
- Use YAML templates stored in a dedicated platform repo.
- Avoid per-project custom scripts unless strictly necessary.
Use Azure DevOps Environments for deployments
- Treat environments as security boundaries with their own approvals/checks.
- Configure gates per environment rather than embedding manual approvals in YAML.
Enforce branch policies
- Require PRs to main/release branches.
- Require successful CI and quality gates before merging.
- Require at least two reviewers for critical repos.
Integrate policy as code early
- Validate IaC (Terraform/Bicep) with OPA/Checkov before apply.
- Use Azure Policy to enforce guardrails at runtime (e.g., deny public internet exposure).
Lock down service connections
- Use Managed Identities or tightly scoped service principals.
- Restrict who can create/edit service connections.
- Audit changes regularly.
Automate secret management
- Store secrets in Azure Key Vault.
- Use Key Vault references and Managed Identity instead of pipeline variables.
Treat scanners as gates, not optional tools
- Make SAST, SCA, and container scanning blocking steps with defined thresholds.
- Configure alerting on repeated failures.
Evidence-first mindset
- For every control, define:
- Implementation mechanism.
- Evidence location and retention time.
- Automate reports/dashboards to export evidence for auditors.
Segregation of duties
- Separate roles:
- Platform team owns templates and environments.
- App teams own business logic and configuration values.
- Security team owns policy definitions and thresholds.
Version everything
- Version policies, templates, and gating logic.
- Use tags and releases in the platform repo to track "policy versions" over time.

Common Pitfalls

1. "Templates" That Are Optional

Mistake: providing recommended templates but allowing teams to bypass them.
Impact: fragmented compliance posture; some apps fully gated, others wide open.
Detection:
- Scan repositories for azure-pipelines.yml not referencing the platform repo.
Fix:
- Enforce a project or org policy: pipelines must use approved templates.
- Restrict who can create/edit pipelines.

2. Over-Permissive Service Connections

Mistake: one "god" service principal with Owner on all subscriptions.
Impact: audit findings, lateral movement risk, potential blast radius of pipeline compromise.
Detection:
- Review Azure DevOps service connection permissions and associated Azure RBAC roles.
Fix:
- Create environment-specific identities with least privilege.
- Use Management Groups and RBAC to scope access tightly.

3. Scanners That Don't Fail Builds

Mistake: running SAST/SCA scans, but ignoring results or only warning.
Impact: critical vulnerabilities shipped to production.
Detection:
- Check for steps where scanners run but no failure condition is configured.
Fix:
- Configure exit codes or fail-on-severity thresholds.
- Treat security findings as blocking gates, not optional reports.

4. Manual Change Approvals Outside CI/CD

Mistake: approvals done in emails or ticket comments without integration to pipelines.
Impact: no traceable linkage between change and deployment; audit evidence is weak.
Detection:
- Compare prod deployments with change records; look for missing linkage.
Fix:
- Require linked work items in PRs and deployments.
- Use environment approvals and external status checks that validate change IDs.

5. Azure Policy Not Integrated with CI

Mistake: relying solely on Azure Policy to block non-compliant resources post-deployment.
Impact: pipelines fail late; engineers frustrated by mysterious denies.
Detection:
- Look at Azure Policy deny events; if most come from CI, you have a shift-left gap.
Fix:
- Mirror Azure Policy rules into IaC scanners (Checkov/OPA).
- Fail early in CI, before apply or deployment.

6. Ignoring Non-Prod Environments

Mistake: strict governance only in prod; dev/qa are "wild west".
Impact: drift, shadow IT, data leaks (dev often holds real data), inconsistent testing.
Detection:
- Compare policy compliance and network rules across non-prod vs prod.
Fix:
- Apply similar guardrails in non-prod, with slightly relaxed thresholds if needed.
- Use same CI/CD architecture and policy bundles across all environments.

7. No Runbooks for Gate Failures

Mistake: gates fail but teams don't know what to do.
Impact: slow incident response, friction, gate bypasses.
Detection:
- Survey teams; track MTTR for gate-related failures.
Fix:
- Publish runbooks for each gate:
- Why it fails.
- Where to view details.
- How to remediate or escalate.

FAQ

1. How does this map to AWS and GCP?

AWS:
- Azure DevOps pipelines ↔ CodePipeline/CodeBuild or GitHub Actions.
- Azure Policy ↔ AWS Config, SCPs.
- Azure Monitor ↔ CloudWatch/CloudTrail.
GCP:
- Azure DevOps pipelines ↔ Cloud Build/Cloud Deploy or GitHub Actions.
- Azure Policy ↔ Organization Policies.
- Azure Monitor ↔ Cloud Logging/Monitoring.

The pattern is the same: centralized templates, policy as code, and environment-level gates.

2. How do I add compliance without slowing delivery?

Make checks fast and automated in dev/qa.
Reserve manual approvals only for high-risk operations (e.g., prod deploys).
Shift heavy scanning earlier in the pipeline to catch issues before the approval step.
Continuously tune thresholds based on data (false positives, frequency of issues).

3. How can I scale this across dozens of teams?

Create a platform team that owns:
- Templates, policies, and gates.
- Documentation and onboarding.
Make templates easy to adopt:
- Good defaults, minimal required parameters.
- Clear examples and starter pipelines.

4. How do I handle legacy applications and pipelines?

Start by wrapping legacy pipelines:
- Add scanners and approvals around them.
Gradually migrate:
- Move to YAML pipelines.
- Move to shared templates.
Keep a sunset plan and timeline for legacy release pipelines.

5. How do I integrate with ITSM and change management?

Require a change record ID tied to:
- Pull requests.
- Deployment stages.
Use environment external checks to validate change state (e.g., "Approved").
Store change IDs as variables in pipeline runs for traceability.

6. What KPIs show that CI/CD compliance is working?

Deployment frequency per environment.
Change failure rate and MTTR.
Policy compliance percentage across resources.
Number of pipeline runs failing due to policy/security, and their remediation times.
Reduction in audit findings over time.

7. How do I handle multi-region or DR scenarios?

Use the same templates and policies per region.
Environment naming can encode region: prod-euw, prod-use.
Use Azure Traffic Manager/Front Door and global routing policies.
Ensure compliance controls are applied in both primary and DR regions; treat DR as production from a compliance standpoint.

8. What's the role of GitHub if we already use Azure DevOps?

Many orgs use:
- GitHub for source control, PRs, and security (e.g., Dependabot, GHAS).
- Azure DevOps pipelines for CI/CD into Azure.
The same pattern applies:
- Policy as code and gates in Azure Pipelines.
- Branch policies and code scanning in GitHub.

Conclusion

A failed compliance audit is usually a symptom of invisible, inconsistent pipeline behavior. Rebuilding Azure DevOps CI/CD with policy as code and security gates converts scattered practices into a standardized, auditable system:

Controls live in code and templates, not in ad-hoc wikis.
Every deployment path is governed by the same rules.
Evidence for auditors is generated automatically via logs, dashboards, and approvals.

Concrete next steps:

Build a controls-to-implementation matrix and align on ownership.
Stand up a platform repo with templates, policies, and tooling.
Introduce environment-based gates and scanners as blocking steps.
Gradually migrate teams to the new pattern, starting with critical systems.

Bookmark this guide, share it with your platform/DevSecOps team, and post your own pipeline templates and policy bundles in the comments so the community can learn from real-world configurations.

References

Connect With Me

If you enjoyed this walkthrough, feel free to connect with me here:

# When Azure Front Door Won't Fail Over: Lessons from a Real Multi-Region DR Drill

Raghavendra R — Sun, 07 Dec 2025 13:12:22 +0000

Azure Front Door didn't fail over during a real multi-region DR drill. Here's what went wrong, how we fixed it, and how to design reliable failover.

The Story / Background
- The architecture we thought we had
- The drill
- What actually happened
Core Concepts: How Azure Front Door Failover Really Works
- Origin groups, priorities, and routing
- Health probes and what "healthy" really means
- Active-active vs active-passive in DR context
- Data tier is not Front Door's job
- Observability for failover
Step-by-Step Guide: Designing Azure Front Door for Real Multi-Region DR
- 1. Define RTO/RPO and failure modes
- 2. Design origin groups and health probe strategy
- 3. Implement with Terraform (example)
- 4. Build DR-aware pipelines and configuration management
- 5. Implement synthetic tests and dashboards
- 6. Run regular DR drills and chaos tests
Architecture Diagram
Best Practices for Azure Front Door Multi-Region DR
Common Pitfalls (and How to Avoid Them)
FAQ
Conclusion
References

A few quarters ago we ran what we thought would be a routine multi-region DR game day on Azure. The plan was simple: simulate a primary region failure, watch Azure Front Door detect the issue, fail over to the secondary region, and go for coffee feeling smug.

Instead, Front Door stared at our "dead" region and kept happily sending it traffic. Users got timeouts. Dashboards lit up. Our DR runbooks suddenly looked very theoretical. I'll walk through what actually happened, how we debugged it, and the patterns I use now whenever I put Azure Front Door in front of multi-region workloads.

The Story / Background

The architecture we thought we had

This was a fairly typical enterprise setup:

Front door / CDN: Azure Front Door Standard/Premium with WAF
Two Azure regions:
- Region A (primary) – AKS + internal Application Gateway, Azure SQL with geo-replica
- Region B (secondary) – warm standby AKS + App Gateway, Azure SQL geo-replica
Routing mode: Active-passive (priority routing) in Front Door
Health probes: Configured at the origin group level to hit /health on each region's App Gateway
Infra-as-Code: Terraform for Front Door, AKS, App Gateway, SQL, and plumbing
Observability: Azure Monitor, Log Analytics, Application Insights, plus synthetic checks from multiple locations

On paper, this ticked all the boxes: multi-region, DR runbooks, IaC, WAF in front, and tests.

The drill

The DR playbook was:

Simulate a partial outage in Region A.
Observe Front Door marking the primary origin unhealthy.
Confirm automatic failover to Region B.
Run smoke tests and declare the drill successful.

Simulation method: we applied a network ACL on the primary App Gateway subnet to effectively blackhole traffic from Front Door, mimicking a critical failure in the app tier.

What actually happened

Front Door did not immediately fail over.
Users got intermittent timeouts and 5xxs, but traffic kept trying Region A for long enough to trigger a production-level incident if this had been real.
Our synthetic checks (which hit the Front Door endpoint) kept reporting "green" for several minutes.
Logs seemed contradictory: App Gateway showed traffic drops; Front Door metrics looked almost normal.

It took a painful hour-plus of log diving and config reviews to realize:

Our health probe path /health was still responding 200 OK from a separate "status" service that hadn't been affected by the simulated failure.
The probe interval and sample size made failover slower than our target RTO.
Some internal services were bypassing Front Door and talking directly to Region A's private endpoints, so even if Front Door had failed over, we still had partial breakage.

The short version: the app died, but the health probes didn't. And Front Door did exactly what we told it to do, not what we thought we configured.

Core Concepts: How Azure Front Door Failover Really Works

Let's unpack what matters for Azure Front Door in a multi-region DR setup.

Origin groups, priorities, and routing

In Azure Front Door Standard/Premium:

You define origin groups (backend pools).
Within a group, each origin (Region A, Region B) can have:
- A priority (for active-passive)
- A weight (for active-active / traffic split)
Front Door sends traffic to the lowest-priority healthy origin.
If that origin becomes unhealthy, it will fail over to the next priority.

The word "healthy" hides a lot of detail.

Health probes and what "healthy" really means

Health probes are where most DR drills go to die:

Probes are configured per origin group with:
- Protocol & port (HTTP/HTTPS, 80/443, etc.)
- Path (e.g., /healthz, /live, /ready)
- Interval & sample size
Front Door considers an origin healthy if it gets enough 2xx/3xx responses from the probe within the configured sample window.
It considers an origin unhealthy after enough failures/timeouts in that window.

Key gotchas:

If your probe hits a different component than your critical path (e.g., a static health page, a separate sidecar), you'll see green while users are screaming.
If the probe is too forgiving (long intervals, large sample size), failover is slower than your RTO.
If the probe path is behind aggressive caching or a CDN rule, Front Door might be probing a cached thing, not your real app.

Active-active vs active-passive in DR context

Active-passive (priority routing)
- Simpler mental model: Region A is primary, Region B is standby.
- Good when your data tier or regulatory constraints make multi-master tricky.
Active-active (latency / weighted)
- Better utilization and resilience, but more complex for stateful workloads.
- Requires careful handling for session affinity, data consistency, and rollouts.

Front Door supports both via routing rules and origin group configuration, but DR behavior and testing strategy differ.

Data tier is not Front Door's job

Front Door only handles HTTP(S) routing. Your data layer is your responsibility:

Azure SQL with active geo-replication or auto-failover groups
Cosmos DB with multi-region writes
Redis with geo-replication or region-local caches
Storage accounts with RA-GRS or dual-write patterns

If your data tier can't fail over fast enough, Front Door can swap regions all day and users will still see errors or stale data.

Observability for failover

For real DR:

Azure Monitor & Log Analytics for Front Door metrics and logs
Application Insights for dependency failures, response times, distributed tracing
Synthetic tests (multi-region) that hit the Front Door endpoint with app-level expectations
End-to-end dashboards showing:
- Front Door health vs backend health
- Per-region error rates
- Failover events and timings

Step-by-Step Guide: Designing Azure Front Door for Real Multi-Region DR

1. Define RTO/RPO and failure modes

Before YAML and Terraform, write down:

RTO – how fast must failover complete?
RPO – how much data loss can you tolerate?
Failure modes you care about:
- Region outage
- App tier outage
- Partial dependency outage (e.g., DB or cache)
- Front Door misconfig / WAF block

Agree this with product, business, and security. DR that only works for "region disappeared" but not "DB is slow" is half a solution.

2. Design origin groups and health probe strategy

For an active-passive setup:

Single origin group with two origins: app-region-a, app-region-b.
Use priority: Region A = 1, Region B = 2.
Configure probes to hit a realistic but cheap path, e.g. /readyz that:
- Checks app's critical dependencies (DB, cache, queue) at lightweight level.
- Returns non-2xx when something essential is broken.

3. Implement with Terraform (example)

Here's a simplified Terraform snippet for Azure Front Door Standard/Premium with two origins and a health probe tuned for DR:

# Resource Group
resource "azurerm_resource_group" "network" {
  name     = "rg-network-prod"
  location = "East US"
}

# Azure Front Door Profile
resource "azurerm_cdn_frontdoor_profile" "prod" {
  name                = "fd-prod-profile"
  resource_group_name = azurerm_resource_group.network.name
  sku_name            = "Standard_AzureFrontDoor"

  tags = {
    environment = "production"
    purpose     = "multi-region-dr"
  }
}

# Front Door Endpoint
resource "azurerm_cdn_frontdoor_endpoint" "prod" {
  name                     = "fd-prod-endpoint"
  cdn_frontdoor_profile_id = azurerm_cdn_frontdoor_profile.prod.id

  tags = {
    environment = "production"
  }
}

# Origin Group with Health Probes
resource "azurerm_cdn_frontdoor_origin_group" "app" {
  name                     = "og-app-multiregion"
  cdn_frontdoor_profile_id = azurerm_cdn_frontdoor_profile.prod.id

  session_affinity_enabled = false

  health_probe {
    interval_in_seconds = 15
    path                = "/readyz"
    protocol            = "Https"
    request_type        = "GET"
  }

  load_balancing {
    additional_latency_in_milliseconds = 0
    successful_samples_required        = 3
    sample_size                        = 4
  }
}

# Primary Origin (Region A)
resource "azurerm_cdn_frontdoor_origin" "app_region_a" {
  name                           = "app-region-a"
  cdn_frontdoor_origin_group_id  = azurerm_cdn_frontdoor_origin_group.app.id
  host_name                      = "app-gw-eastus.contoso.internal"
  http_port                      = 80
  https_port                     = 443
  origin_host_header             = "app.contoso.com"
  priority                       = 1
  weight                         = 1000
  enabled                        = true

  certificate_name_check_enabled = true
}

# Secondary Origin (Region B)
resource "azurerm_cdn_frontdoor_origin" "app_region_b" {
  name                           = "app-region-b"
  cdn_frontdoor_origin_group_id  = azurerm_cdn_frontdoor_origin_group.app.id
  host_name                      = "app-gw-westus.contoso.internal"
  http_port                      = 80
  https_port                     = 443
  origin_host_header             = "app.contoso.com"
  priority                       = 2
  weight                         = 1000
  enabled                        = true

  certificate_name_check_enabled = true
}

# Route to map requests to origin group
resource "azurerm_cdn_frontdoor_route" "app_route" {
  name                          = "app-route"
  cdn_frontdoor_endpoint_id     = azurerm_cdn_frontdoor_endpoint.prod.id
  cdn_frontdoor_origin_group_id = azurerm_cdn_frontdoor_origin_group.app.id
  patterns_to_match             = ["/*"]
  supported_protocols           = ["Http", "Https"]
  https_redirect_enabled        = true

  forwarding_protocol    = "HttpsOnly"
  link_to_default_domain = true
}

4. Build DR-aware pipelines and configuration management

Treat Front Door config as code (Terraform/Bicep).
Protect it with:
- Pull requests and mandatory reviews.
- Policy checks (e.g., checks that every origin has a probe).
- Automated validation in a non-prod "chaos" environment.
Build pipelines that can:
- Temporarily disable an origin (simulated outage).
- Flip priorities if you need a manual failover.

Example Azure CLI snippet to temporarily disable Region A origin:

#!/bin/bash

# Disable primary origin for DR testing
az afd origin update \
  --resource-group rg-network-prod \
  --profile-name fd-prod-profile \
  --origin-group-name og-app-multiregion \
  --origin-name app-region-a \
  --enabled-state Disabled

echo "Origin app-region-a has been disabled. Traffic should failover to app-region-b."

# Monitor failover progress
echo "Monitoring Front Door metrics for 5 minutes..."
sleep 300

# Re-enable origin after test
read -p "Re-enable primary origin? (y/n): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
    az afd origin update \
      --resource-group rg-network-prod \
      --profile-name fd-prod-profile \
      --origin-group-name og-app-multiregion \
      --origin-name app-region-a \
      --enabled-state Enabled
    echo "Origin app-region-a has been re-enabled."
fi

Use this in non-prod to safely observe Front Door's behavior.

5. Implement synthetic tests and dashboards

Create synthetic tests that:
- Hit https://app.contoso.com/healthcheck-end-to-end
- Validate response code, body, and latency
- Run from multiple Azure regions (or external providers)
Build dashboards that show, per region:
- Front Door origin health state
- App response times
- Error rates and timeouts

Ensure your on-call runbook includes how to read these graphs during a DR event.

6. Run regular DR drills and chaos tests

Treat DR like CI:

Schedule recurring game days (quarterly is a good start).
Test different failure modes: origin disabled, DB unavailable, cache down, WAF rule gone wild.
Time how long:
- Front Door takes to mark the origin unhealthy.
- Users experience degraded performance.
- The team takes to declare failover complete.

Capture and track those as SLOs for DR.

Architecture Diagram

The diagram below illustrates the multi-region Azure Front Door DR architecture discussed in this post:

Key Components:

Azure Front Door acts as the global load balancer with WAF protection
Priority-based routing with Region A as primary (Priority 1) and Region B as secondary (Priority 2)
Health probes monitor /readyz endpoints to determine origin health
Geo-replicated Azure SQL ensures data availability across regions
Azure Monitor provides comprehensive observability across all components

Traffic Flow:

Normal Operation: User requests → Front Door → Region A (Primary) → Application Gateway → AKS → Azure SQL Primary
During Failover: Health probe fails on Region A → Front Door redirects traffic → Region B (Secondary) → Application Gateway → AKS → Azure SQL Geo-Replica
Monitoring: All components send telemetry to Azure Monitor and Application Insights for real-time observability

Best Practices for Azure Front Door Multi-Region DR

Health checks must reflect real risk
- Probe something that depends on your critical services (DB, cache, queue) but is cheap to execute.
Use explicit priorities for active-passive
- Don't rely on latency routing if your DR strategy is "primary then fail over".
Align probe configuration with RTO
- Shorter intervals and smaller sample sizes mean faster failover, at the cost of more sensitivity to transient blips.
Decouple internal vs external paths
- Ensure internal clients also route via Front Door (or a consistent DR mechanism), otherwise they'll keep hitting a dead region.
Keep origin host headers consistent
- Use a single app host name to simplify config, TLS, and debugging.
Tag everything
- Use tags for env, region, dr-role, owner, criticality. Helps a lot in DR reviews and cost tracking.
Secure by default
- Use WAF, private origins (Private Link / internal App Gateway), and managed identities.
Centralize observability
- One place where SRE/DevOps can see Front Door + app + DB health across regions.
Automate DR verification
- After every significant infrastructure or Front Door change, run automated DR checks in lower environments.

Common Pitfalls (and How to Avoid Them)

1. Health probes hitting the wrong thing

Problem: Probes target a static /health that doesn't reflect real dependencies.

Impact: Front Door sees green while the app is actually broken, delaying failover or preventing it entirely.

Fix:

Implement /readyz or /healthz-deep that checks key dependencies.
Make sure it returns non-2xx when critical components are broken.

2. Probes behind caching or CDN rules

Problem: Health probe requests get cached or served by a rule path that hides backend errors.

Impact: Probes never see failures; Front Door won't fail over.

Fix:

Exclude health probe paths from caching and rewrites.
Validate with logs that probes hit the actual app.

3. Overly large sample sizes and long intervals

Problem: Probe interval = 60s, sample size = 16, successful samples required = 15.

Impact: It can take many minutes of continuous failures before Front Door marks an origin unhealthy.

Fix:

Tune probe interval and samples to align with your RTO.
In many enterprise setups, something like 15–30s intervals and small sample windows (e.g., 3 out of 4) is a better starting point.

4. Internal traffic bypassing Front Door

Problem: Internal services talk directly to App Gateway or App Service in Region A.

Impact: External users may fail over via Front Door, but internal APIs and jobs still rely on the failed region.

Fix:

Use Front Door (or an equivalent internal traffic manager) as the standard entry point for inter-service communication where DR matters.
Or implement separate internal traffic management with the same multi-region logic.

5. No DR for the data tier

Problem: App tier is multi-region, but SQL or Redis is single-region.

Impact: Failover appears successful at the HTTP layer, but the secondary region has no usable data.

Fix:

Plan data DR first: geo-replication, multi-region writes, failover groups.
Wire app config (connection strings, secrets) to automatically use the correct endpoint after failover.

6. DR tests only in staging

Problem: DR game days happen in lower environments that don't mirror prod topology, traffic patterns, or data sensitivity.

Impact: False confidence. Things that worked in staging break in production.

Fix:

Run carefully scoped DR drills in production: limited time windows, pre-announced, with a rollback plan.
Start small (e.g., partial traffic) and grow once you've built muscle.

7. No clear runbook for Front Door changes

Problem: During an incident, engineers manually poke around in the Azure Portal, toggling origins and routing rules.

Impact: Slow response, new mistakes, hard to audit.

Fix:

Document and automate incident playbooks:
- "Disable primary origin"
- "Force traffic to Region B"
- "Roll back to normal state"
Implement them as scripts or pipeline tasks, not "click here, then here".

FAQ

1. Azure Front Door vs Traffic Manager vs DNS for DR?

Front Door: Layer 7 routing, WAF, caching, modern Standard/Premium features; ideal for web/API DR.
Traffic Manager: DNS-based routing, good for non-HTTP workloads or hybrid scenarios.
DNS only: Very coarse and slow control. You generally layer Front Door or Traffic Manager on top of DNS, not instead of them.

For most modern web workloads, use Front Door as the primary DR switch and DNS as a coarse backup.

2. How do I test failover safely in production?

Start by failing a small percentage of traffic (e.g., use weighted routing in a subset environment).
Use short, well-announced windows.
Have an automated rollback (re-enable origin, revert routing).
Observe impact in real time on error budgets and SLO dashboards.

3. How should I choose health probe paths?

Use a dedicated endpoint like /readyz or /health-deep.
It should check critical dependencies in a lightweight way.
Return non-2xx when the app is not fit to serve traffic.
Exclude it from caching and WAF rules that could mask problems.

4. What's a reasonable failover time with Front Door?

It depends on your probe configuration, but many teams target:

Detection: 30–90 seconds
Failover complete: Under 2–3 minutes

If your RTO is stricter, tune probes more aggressively and mitigate false positives with solid observability and retry logic at the client layer.

5. How do I handle stateful sessions with multi-region Front Door?

Options:

Go stateless at the app layer (recommended where possible).
Use distributed caches (e.g., Redis) or centralized session stores that replicate between regions.
For active-passive, consider shorter session lifetimes + re-auth on failover.
Be careful with "sticky sessions" and ensure they don't lock users to a dead region.

6. How do I bring this pattern into a legacy environment?

Start by putting Front Door in front of your existing primary region.
Add a secondary region with a subset of services.
Use DR drills in lower environments first to refine runbooks.
Gradually move more legacy components behind consistent Front Door routing.

You don't have to go all-in on day one; even a partial DR capability is better than none.

7. How do I measure DR success?

Track:

RTO achieved vs target during drills.
RPO (data loss or replay needs).
User impact during failover (error rates, latency).
Time for engineers to execute runbooks.
Number of incidents where DR actually saved you.

Turn those into SLOs that leadership can understand.

8. How does this compare to AWS and GCP?

Rough mapping:

AWS: CloudFront + ALB/NLB + Route 53 health checks and routing policies.
GCP: External HTTP(S) Load Balancer + Cloud CDN + Cloud Armor.

Concepts are similar: health checks, multi-region backends, DR drills. The main differences are in configuration models, naming, and surrounding ecosystem.

Conclusion

In our DR drill, Azure Front Door didn't "fail over" because:

Our health probes were lying to it.
Our expectations didn't match our configuration.
Our DR practice was theoretical rather than muscle memory.

The good news: once you understand how Front Door evaluates backend health and how to align probes with real-world failure modes, it becomes a powerful tool for multi-region resilience.

If you take one thing from this story, let it be this:

Don't wait for a real outage to find out whether your DR works.

Start with a lower environment, codify Front Door and DR behavior in Terraform/Bicep, set up observability, and schedule regular game days. Every drill you run now is one less panic later.

If this resonated with you, follow along, drop your own DR stories in the comments, and share this with the person in your org who will be on call when Azure Front Door is your first line of defense.

References

Connect With Me

If you enjoyed this walkthrough, feel free to connect with me here:

Why Personal Branding Matters for Tech Professionals

Siva Sankari — Thu, 04 Dec 2025 11:35:03 +0000

Table of Contents

Introduction
Why Personal Branding Matters in Tech
How Personal Branding Helps Developers Specifically
Step-by-Step: How to Build a Technical Personal Brand

4.1 Define Your Technical Niche
4.2 Create Developer-Focused Public Artifacts
4.3 Showcase Your Code (with Example)
4.4 Share Real-World Use Cases & Learnings
4.5 Contribute to Open Source Strategically
4.6 Automate Content Publishing Using Dev Tools
1. Example: A Simple Portfolio API You Can Add to Your Brand
2. Personal Branding Tools for Developers
3. Developer Tips for Growing Your Tech Brand
4. Common Developer Questions
5. Conclusion

🧩Connect with me for career guidance, personalized mentoring, and real-world hands-on project experience www.linkedin.com/in/learnwithsankari

Introduction

Most developers think “my skills speak for themselves.” They don’t especially in an industry moving as fast as 2025.

Personal branding is not about becoming an influencer.
It’s about being discoverable, trusted, and visible in the tech ecosystem.

In this practical guide, we’ll explore why personal branding matters for developers, along with tactical steps complete with code examples you can apply starting today.

Why Personal Branding Matters in Tech

Tech careers depend on:

credibility
proof of work
community reputation
discoverability by recruiters, founders, and peers

Because:

80%+ of tech hiring happens through referrals and community visibility
Strong GitHub/Dev.to activity often outweighs typical CVs
Engineers with strong brands get higher salaries and better opportunities
Freelance and consulting opportunities depend almost entirely on online presence

In short:
Your personal brand is your career’s API surface. Make it clean, clear, and callable.

How Personal Branding Helps Developers Specifically

1. Faster Opportunities

Your open-source repos, Dev.to articles, and GitHub activity do more than your résumé ever will.

2. Credibility Beyond Titles

“Senior Developer” means nothing without visible proof.
A single well-written article or repo can showcase depth better than a 5-page CV.

3. Networking Without Networking

Strong personal branding = inbound opportunities.
People reach out to you.

4. It Future-Proofs Your Career

Even if tech stacks change:

problem-solving
technical thinking
reputation … remain timeless.

Step-by-Step: How to Build a Technical Personal Brand

Define Your Technical Niche

Avoid broad labels like “Full Stack Developer.”
Instead, go specific:

“SRE specializing in Kubernetes cost optimization”
“Frontend dev focusing on high-performance React apps”
“DevOps engineer building secure CI/CD pipelines”

Developer Tip:
Your niche is not permanent it evolves as your skills evolve.

🧩Connect with me for career guidance, personalized mentoring, and real-world hands-on project experience www.linkedin.com/in/learnwithsankari

Create Developer-Focused Public Artifacts

Public artifacts include:

GitHub repos
Dev.to tutorials
architecture diagrams
demo videos
Dockerfiles
API design documents

If you create something at work, recreate a sanitized example and publish it.

Showcase Your Code (with Example)

Your personal brand should include code samples that demonstrate clarity, structure, and thought process.

Example — A clean Python script that fetches GitHub repo stats for your portfolio:

import requests

def fetch_repo_stats(username):
    url = f"https://api.github.com/users/{username}/repos"
    response = requests.get(url)

    if response.status_code != 200:
        raise Exception("Error fetching repositories")

    repos = response.json()
    output = []

    for repo in repos:
        output.append({
            "name": repo["name"],
            "stars": repo["stargazers_count"],
            "forks": repo["forks_count"]
        })

    return output

if __name__ == "__main__":
    stats = fetch_repo_stats("your-username")
    for repo in stats:
        print(repo)

You can share this as:

a GitHub repo
a Dev.to tutorial
a small portfolio widget

Developers want to see cleanliness, not complexity.

Share Real-World Use Cases & Learnings

Instead of posting “I learned Docker,” post:

“Here’s how I cut container build time from 90s to 22s using multi-stage builds.”

This differentiates you from 90% of developers online.

Contribute to Open Source Strategically

Start with:

improving README
fixing documentation
adding unit tests
small bug fixes

Visibility comes from consistent contributions, not massive ones.

Automate Content Publishing Using Dev Tools

Your personal branding workflow can run like CI/CD.

Example: Automate Dev.to publishing using their API

curl -X POST \
  -H "api-key: $DEV_TO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "article": {
          "title": "Automating Dev.to Publishing",
          "published": true,
          "body_markdown": "# Hello Dev Community 🚀"
        }
      }' \
  https://dev.to/api/articles

Automation = consistency, consistency = visibility.

🧩Connect with me for career guidance, personalized mentoring, and real-world hands-on project experience www.linkedin.com/in/learnwithsankari

Example: A Simple Portfolio API You Can Add to Your Brand

A personal brand becomes powerful when developers can consume it as an API.

Example: Node.js Express API for serving your profile and projects.

import express from "express";
const app = express();

app.get("/profile", (req, res) => {
  res.json({
    name: "Your Name",
    role: "DevOps Engineer",
    skills: ["Docker", "Kubernetes", "Terraform", "Python"]
  });
});

app.get("/projects", (req, res) => {
  res.json([
    {
      name: "K8s Autoscaler",
      description: "Dynamic autoscaling via custom metrics",
      tech: ["Go", "Prometheus"]
    },
    {
      name: "Terraform AWS Bootstrap",
      description: "Reusable IaC module for VPC + IAM",
      tech: ["Terraform", "AWS"]
    }
  ]);
});

app.listen(3000, () => console.log("Portfolio API running on port 3000"));

This can be deployed on:

Vercel
AWS Lambda
Fly.io
Render

Add it to your resume or LinkedIn. Recruiters love interactive portfolios.

Personal Branding Tools for Developers

Content Creation

Hashnode
Dev.to
Medium
GitHub Pages
Notion

Code & Portfolio Hosting

GitHub
GitLab
Vercel
Netlify

Automation Tools

GitHub Actions
Zapier / n8n
Dev.to API

Visual Tools

Excalidraw (architecture diagrams)
Mermaid.js
Draw.io

Developer Tips for Growing Your Tech Brand

Post once a weekeven small learnings.
Avoid generic content (“Top 10 tips…”). Use real-world examples.
Show your failures they teach more than successes.
Document your debugging process other devs LOVE this.
Write content like you write code:
- concise
- clear
- modular

Common Developer Questions

Q1: Does personal branding really matter for backend/infra engineers?
Yes. Infra roles especially rely on trust. Your published scripts, IaC templates, and case studies build credibility.

Q2: Do I need to become an influencer?
Not at all. You need to be discoverable, not famous. Even 500 strong followers can change your career.

Q3: I’m introverted. Can I still build a brand?
Yes—write instead of speaking.
Introverts often produce the deepest technical content.

Q4: What if my skills aren’t expert-level yet?
Share your learning journey, not expertise.
Beginners relate more to beginners.

Conclusion

Personal branding is a force multiplier for tech professionals. It improves visibility, accelerates opportunities, attracts recruiters, and builds trust in your skills all while making you a better engineer through consistent sharing.

Start small. Publish one thing this week.
Your future self will thank you.

🧩Connect with me for career guidance, personalized mentoring, and real-world hands-on project experience www.linkedin.com/in/learnwithsankari 🚀

Agentic AI for Developers: Building Autonomous AI Systems Instead of Chatbots

Khamar fathima — Mon, 01 Dec 2025 18:31:39 +0000

For years, developers have used AI as a tool an API that generates text, code, or images when prompted. But the next stage of AI isn’t about better prompting. It’s about AI that can think, plan, act, and execute tasks autonomously.

This shift is called Agentic AI, and it’s about to reshape how software gets built.

🔥 Not Just Generating — Completing Tasks

Traditional Gen-AI:
• Input prompt → output text/code/image

Agentic AI:
• Understands a goal
• Breaks it into subtasks
• Triggers tools / APIs / code
• Executes steps
• Evaluates results
• Iterates until success

It’s not a chatbot. It’s an AI worker.

🧠 Core Architecture of an AI Agent

AI Agents usually revolve around these components:
1. Memory
Stores previous actions, user state, results, and context to improve future decisions.
2. Reasoning / Planning
Creates an execution plan instead of responding instantly.
3. Action Module
Uses tools, APIs, browsers, code execution, databases, cloud CLI, etc.
4. Reflection Loop
Analyzes failures and continues until the goal is achieved.

If traditional AI is a function call, Agentic AI is a running program with loops, feedback, and autonomy.

🛠️ Tools and Frameworks Developers Can Start Using Today

If you’re a developer, the easiest way to build AI agents today is through:
• LangChain
• AutoGen
• OpenAI Assistants API
• CrewAI
• LlamaIndex (for memory + context management)

And if you want a simple demonstration, even this concept works:
code snippet:
while not task_complete:
plan = ai.generate_plan()
action = execute(plan)
feedback = evaluate(action)
ai.update_memory(feedback)

That loop is the essence of Agentic intelligence — plan → act → evaluate → improve → repeat.

💻 Example Use Cases Developers Can Build

These ideas are realistic and already being built by devs today:

🔹 Code Agent

Give it a repository and a feature request. It:
• Reads the codebase
• Generates the required files
• Applies modifications
• Runs tests
• Fixes errors until passing

🔹 Product Research Agent

Input: “Find the top 20 HR SaaS startups that raised funding last year.”
It:
• Scrapes sites automatically
• Aggregates results
• Compresses data
• Creates a final report

🔹 Deployment Agent

Agent that:
• Detects outdated dependencies
• Updates them safely
• Runs CI/CD
• Rolls back on failure

This is not prompting this is fully automated devops.

🧩 Why Developers Should Pay Attention

Agentic AI will not replace developers.
It will replace how developers work.

Right now:
• Devs write code → tools help

Future:
• Devs set goals → AI completes tasks → devs review and refine

Developer skill will shift from manual code writing to:
• Architecture
• Strategy
• Debugging
• Reviewing agent output
• Integrating AI into systems

Those who learn this early will have a massive advantage.

⚠️ Realistic Limitations Today

Agentic AI is powerful but imperfect.

Developers should expect:
• Tool errors
• Missing context
• Unclear reasoning
• Sandbox restrictions
• Unexpected side effects

That’s why humans remain essential autonomous does not mean unsupervised.

⭐ Final Message to Developers

Don’t wait for tutorials. Start building your own agent even a tiny one.

If you learn:
• prompt engineering
• planning + memory logic
• tool invocation
• evaluation feedback loops

You’re not just learning AI
you’re learning the next generation of software development.

Agentic AI isn’t here to take away developer jobs.
It’s here to take away the boring parts of development.

The devs who embrace this will build the future.
The devs who ignore it will fall behind it.

🚀 Secrets Safe, 3-Tier Deployments Fast: Terraform + Azure Key Vault Complete Hands-On Guide

TechOpsBySonali — Mon, 01 Dec 2025 10:40:33 +0000

Deploying the same 3-tier application again and again — dev, test, prod — shouldn’t feel like déjà vu every time.
But in many cloud teams, it does.
Manual fixes… copy-pasted Terraform… secrets hardcoded inside .tfvars…
One small change in dev, not updated in prod…
Boom! Configuration drift, broken deployments, security risks.

This hands-on guide shows you exactly how to eliminate all of that using:

✅ Terraform Modular Architecture
✅ Azure Key Vault for Secure Secrets Management
✅ Remote State in Azure Storage
✅ GitHub Actions for Fully Automated CI/CD

By the end, you’ll be able to deploy dev, test, and prod environments identically, securely, and on autopilot.

🔥 1. The Problem: Manual Deployments = Drift + Errors + Chaos

Most teams still deploy environments like this:

Copy old Terraform folder
Change a few names
Adjust IPs manually
Forget a network rule
Hardcode passwords “for now” 😅
Fix mistakes after something breaks

Result?

❌ Inconsistent infra across environments
❌ Security breaches due to exposed secrets
❌ Time wasted troubleshooting
❌ Zero auditability
❌ No single source of truth

This hands-on solves exactly this.

🚀 2. Why This Use Case Matters

Cloud teams today need consistency + speed + security.
Manually managing infra no longer works.

This use case delivers:

🧱 Reusable Terraform Modules

Resource Group, VNet, Subnet, NSG, VM — once written, reused forever.

🔐 Zero Secret Sprawl

Passwords and sensitive values stored in Azure Key Vault, pulled directly in Terraform.

🚦 Environment-driven Deployment

All differences (dev/test/prod) live in terraform.tfvars.

🤖 GitHub Actions = Fully Automated Deployments

Plan → Validate → Apply → Audit logs — everything automated.

This is production-grade Terraform, not just a tutorial.

🕒 3. When You Need This Use Case

You need this setup when:

✔️ Deploying multiple environments
✔️ Avoiding inconsistent infra
✔️ Securing all secrets centrally
✔️ Enabling fast onboarding
✔️ Needing auditability and governance
✔️ Running builds from CI/CD pipelines
✔️ Scaling infra to multiple regions

This architecture grows as your company grows.

🛠️ 4. Prerequisites

Azure Subscription
Azure CLI
Terraform Installed
Git + GitHub
Key Vault access
Optional: GitHub Actions Service Principal

You’re ready.

🎯 5. Challenge Questions (Interview-Level)

These make great DevOps interview questions too:

How do you avoid copy-paste Terraform for dev/test/prod?
How do you secure plaintext secrets in Terraform?
How do you stop network drift between environments?
How do you enable new developers to deploy infra securely?
How do you prove that all environments are deployed from the same code?
How do you roll back a Terraform deployment?
How do you prevent faulty tfvars from affecting prod?
How do you design a module for both Linux & Windows VMs?
How do you deploy identical infra to two regions?
Why are modules better than plain Terraform scripts?

🧑‍💻 6. Complete Hands-On Implementation

Below is the full real-life end-to-end setup.

STEP 1️⃣ — Authenticate to Azure & Configure Git

az login
git config --global user.name "yourname"
git config --global user.email "yourmail@example.com"

Initialize GitHub repository:

git init
echo "# azure-3-tier-architecture" >> README.md
git add .
git commit -m "first commit"
git branch -M main
git remote add origin https://github.com/<yourid>/azure-3-tier-architecture.git
git push -u origin main

STEP 2️⃣ — Create Backend Resources

LOCATION="eastus"
RG_NAME="tfstate-rg"
STORAGE_NAME="mytfstate12345"
CONTAINER_NAME="tfstate"
KV_NAME="mykeyvault12345"

az group create --name $RG_NAME --location $LOCATION
az storage account create --name $STORAGE_NAME --resource-group $RG_NAME --location $LOCATION --sku Standard_LRS
az storage container create --name $CONTAINER_NAME --account-name $STORAGE_NAME

STEP 3️⃣ — Create Key Vault + Secrets

az keyvault create --name $KV_NAME --resource-group $RG_NAME --location $LOCATION

Store secrets

az keyvault secret set --vault-name $KV_NAME --name "vm-username" --value "learning"
az keyvault secret set --vault-name $KV_NAME --name "vm-password" --value "Redhat@12345"

STEP 4️⃣ — Create GitHub Service Principal

az ad sp create-for-rbac --name "github-spn" --role="Contributor" --scopes="/subscriptions/<subid>" --sdk-auth

Save JSON output.

Grant Key Vault access:

az role assignment create \
  --assignee <clientId> \
  --role "Key Vault Secrets User" \
  --scope $(az keyvault show --name $KV_NAME --query id -o tsv)

STEP 5️⃣ — Create Terraform Structure

terraform/
├── backend.tf
├── main.tf
├── variables.tf
├── environments/
│   ├── dev/terraform.tfvars
│   ├── test/terraform.tfvars
│   └── prod/terraform.tfvars
└── modules/
    ├── rg/
    ├── vnet/
    ├── subnet/
    ├── nsg/
    └── vm/

STEP 6️⃣ — Add Root Terraform Files

backend.tf

terraform {
  backend "azurerm" {
    resource_group_name  = "tfstate-rg"
    storage_account_name = "mytfstate12345"
    container_name       = "tfstate"
    key                  = "3tier/dev.tfstate"
  }
}

variables.tf

variable "location" { type = string }
variable "rg_name" { type = string }
variable "vnet_name" { type = string }
variable "vm_name" { type = string }

main.tf

provider "azurerm" {
  features {}
}

module "rg" {
  source   = "./modules/rg"
  name     = var.rg_name
  location = var.location
}

module "vnet" {
  source   = "./modules/vnet"
  name     = var.vnet_name
  location = var.location
  resource_group_name = module.rg.name
}

module "subnet" {
  source = "./modules/subnet"
  name   = "${var.vnet_name}-subnet"
  vnet_name = module.vnet.name
  resource_group_name = module.rg.name
  nsg_id = module.nsg.id
}

module "nsg" {
  source = "./modules/nsg"
  name   = "${var.vnet_name}-nsg"
  location = var.location
  resource_group_name = module.rg.name
}

data "azurerm_key_vault" "kv" {
  name                = "mykeyvault12345"
  resource_group_name = "tfstate-rg"
}

data "azurerm_key_vault_secret" "vm_username" {
  name         = "vm-username"
  key_vault_id = data.azurerm_key_vault.kv.id
}

data "azurerm_key_vault_secret" "vm_password" {
  name         = "vm-password"
  key_vault_id = data.azurerm_key_vault.kv.id
}

module "vm" {
  source              = "./modules/vm"
  name                = var.vm_name
  location            = var.location
  resource_group_name = module.rg.name
  subnet_id           = module.subnet.id
  admin_username      = data.azurerm_key_vault_secret.vm_username.value
  admin_password      = data.azurerm_key_vault_secret.vm_password.value
}

STEP 7️⃣ — Environment Variables

dev

rg_name  = "rg-dev"
vnet_name = "vnet-dev"
vm_name   = "vm-dev"
location = "eastus"

STEP 8️⃣ — Build Modules

(Example: Resource Group)

modules/rg/main.tf

resource "azurerm_resource_group" "rg" {
  name     = var.name
  location = var.location
}

modules/rg/variables.tf

variable "name" { type = string }
variable "location" { type = string }

modules/rg/outputs.tf

output "name" { value = azurerm_resource_group.rg.name }

Repeat similar for vnet, subnet, nsg, vm.

🎉 Final Output

You now have:

✔️ Modular Terraform
✔️ Secure secrets with Key Vault
✔️ Remote state
✔️ Reusable environments
✔️ Ready for GitHub Actions automation

This is true enterprise-grade Infrastructure as Code.

⭐ Follow Me for Daily DevOps & Cloud Content

🔵 LinkedIn: @techopsbysonali
🐦 Twitter / X: @techopsbysonali
📸 Instagram: @techopsbysonali
📝 Medium: @techopsbysonali
📚 Dev.to: @techopsbysonali
🌐 Hashnode: techopsbysonali.hashnode.dev
🖋️ Blogger: techopsbysonali.blogspot.com

📲 Join My WhatsApp Communities
**
👉 Personalized Guidance: https://wa.me/7620774352
👉 Latest Updates Group: https://lnkd.in/gVTvmRBa
👉 Pune Local Meetup Group: https://lnkd.in/gQbKaUeX