DEV Community

V.Ray
V.Ray

Posted on

7 Azure Security Gaps I have Seen in Production (and How to Fix Them)

7 Azure Security Gaps I have Seen in Production (and How to Fix Them)

Note: The issues described here are from real production environments, security audits, and incident reviews — not lab or demo setups.

Over the past few years, while managing 100+ Azure virtual machines in production environments running critical PeopleSoft and Oracle workloads, I have repeatedly encountered the same security gaps across different organizations.

These are not theoretical vulnerabilities from security blogs. They are real-world issues that surface during security audits, incident response investigations, and routine infrastructure reviews.

In this article, I will walk through seven critical Azure security gaps I have personally identified in production, along with:

  • Real-world examples

  • Detection methods using Azure-native tools

  • Practical, step-by-step remediation strategies

Whether you manage a handful of VMs or hundreds, addressing these gaps will significantly improve your Azure security posture before they turn into incidents.

1. Inadequate Network Security Group (NSG) Rules

The Problem

Network Security Groups (NSGs) are the primary network-level control for Azure VMs. Yet I often find production environments with overly permissive inbound rules such as:

  • Allow * from 0.0.0.0/0

  • SSH (22) or RDP (3389) exposed to the public internet

Real-world example:
During a security audit, I found 12 production VMs with SSH open to the internet. NSG flow logs showed 50,000+ failed login attempts in 30 days.

How to Detect:

Azure Portal

  • Network Security Groups → Inbound rules

  • Look for:

    • Source: Any or 0.0.0.0/0
    • Ports: 22, 3389, 1433, 5432
  • Review Effective security rules per VM

KQL (NSG Flow Logs)

AzureNetworkAnalytics_CL
| where SubType_s == "FlowLog"
| where SrcIP_s !startswith "10."
and SrcIP_s !startswith "172."
and SrcIP_s !startswith "192.168."
| where DestPort_d in (22, 3389)
| summarize Attempts = count() by SrcIP_s, DestPort_d
| order by Attempts desc

The Fix
Immediate

  • Remove 0.0.0.0/0 access for SSH/RDP

  • Whitelist trusted office or VPN IPs

  • Use Azure Bastion

  • Enable Just-In-Time (JIT) VM access via Microsoft Defender for Cloud

Long-term

  • Use Application Security Groups (ASGs)

  • Implement Azure Firewall

  • Enforce NSG standards using Azure Policy

2. Missing Azure Backup Policies

The Problem

Many teams assume Azure automatically backs up VMs. It doesn’t.
I have seen production workloads with no Recovery Services vault and no backups.

How to Detect

Azure Portal

  • Recovery Services vaults → Backup items

  • Compare against VM inventory

Resources
| where type == "microsoft.compute/virtualmachines"
| project name, resourceGroup
| join kind=leftouter (
RecoveryServicesResources
| where type contains "protectedItems"
| extend vmName = tostring(split(properties.sourceResourceId, "/")[8])
| project vmName
) on $left.name == $right.vmName
| where isnull(vmName)

The Fix

  • Create Recovery Services vaults per region

  • Define policies aligned with RPO/RTO

  • Enable Soft Delete

  • Test restores quarterly

  • Configure backup alerts via Azure Monitor

3. Weak Authentication Methods

The Problem

Password-based SSH and RDP access is still common. I have seen environments using the same password across multiple admin accounts, creating a single point of failure.

How to Detect

Linux

grep -i PasswordAuthentication /etc/ssh/sshd_config
grep -i PubkeyAuthentication /etc/ssh/sshd_config

Windows

  • Review Azure AD Conditional Access

  • Check Azure AD sign-in logs

The Fix

Linux

  • Disable passwords: PasswordAuthentication no

  • Use SSH key-based authentication

  • Store keys in Azure Key Vault

  • Enable Azure AD Login for Linux VMs

Windows

  • Enforce MFA

  • Reduce local admin usage

  • Use Privileged Access Workstations (PAWs)

4. Unencrypted Data in Transit

The Problem

HTTP endpoints, databases without TLS, and FTP transfers still exist in production — even for sensitive data.

How to Detect

Application Gateway / WAF Logs

AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS"
| where requestUri_s startswith "http://"
| summarize Count = count() by requestUri_s, clientIP_s
| order by Count desc

The Fix

  • Enforce HTTPS end-to-end

  • Redirect HTTP → HTTPS

  • Enable TLS 1.2+ for databases

  • Replace FTP with SFTP/FTPS

  • Manage certificates via Azure Key Vault

5. Improper Role-Based Access Control (RBAC)

The Problem

Developers often have Contributor or Owner access at subscription level, violating least privilege.

How to Detect

authorizationresources
| where type == "microsoft.authorization/roleassignments"
| where properties.roleDefinitionId contains "Owner"
or properties.roleDefinitionId contains "Contributor"
| project principalId, scope = tostring(properties.scope)

The Fix

  • Audit RBAC regularly

  • Remove unnecessary subscription-level roles

  • Create custom roles

  • Use Azure AD PIM

  • Enable access reviews

6. Missing Activity Log Alerts

The Problem

Critical changes happen with no alerts, leaving teams blind.

How to Detect

  • Azure Monitor → Alerts

  • Filter by Activity Log

The Fix

Create alerts for:

  • NSG rule changes

  • VM deletion/creation

  • RBAC changes

  • Defender for Cloud policy updates

  • Key Vault access changes

7. Exposed Management and Database Ports

The Problem

Beyond SSH/RDP, I often find database ports and admin interfaces exposed to the internet.

How to Detect

AzureNetworkAnalytics_CL
| where SubType_s == "FlowLog"
| where DestPort_d in (1433, 3306, 5432, 8080, 8443, 27017, 6379)
| where SrcIP_s !startswith "10."
and SrcIP_s !startswith "172."
and SrcIP_s !startswith "192.168."
| summarize Count = count() by DestPort_d, DestIP_s
| order by Count desc

The Fix

  • Close non-essential ports

  • Use Private Endpoints / Private Link

  • Deploy Application Gateway with WAF

  • Use Azure Bastion

  • Follow Defender for Cloud recommendations

Conclusion

Azure security is not a one-time setup. It requires continuous monitoring, audits, and proactive remediation.

Most of these gaps exist not because of missing tools, but because of missing guardrails. Azure Policy, Defender for Cloud, and Infrastructure as Code can prevent most of them before they reach production.

Quick Security Checklist (≈ 85 Minutes)

  • Audit NSG rules for 0.0.0.0/0 (5 min)

  • Verify VM backups (10 min)

  • Review RBAC assignments (15 min)

  • Check SSH password authentication (10 min)

  • Enable activity log alerts (20 min)

  • Scan for exposed DB ports (10 min)

  • Identify HTTP traffic (15 min)

Let’s Discuss

Have you seen similar Azure security issues in production?
What guardrails do you use to prevent them?

Top comments (0)