Sarvar Nadaf for AWS Community Builders

Posted on Apr 20

Stop Giving AI Agents AWS Credentials: A Better Way to Secure Access

#aws #ai #agents #discuss

👋 Hey there, tech enthusiasts!

I'm Sarvar, a Cloud Architect with a passion for transforming complex technological challenges into elegant solutions. With extensive experience spanning Cloud Operations (AWS & Azure), Data Operations, Analytics, DevOps, and Generative AI, I've had the privilege of architecting solutions for global enterprises that drive real business impact. Through this article series, I'm excited to share practical insights, best practices, and hands-on experiences from my journey in the tech world. Whether you're a seasoned professional or just starting out, I aim to break down complex concepts into digestible pieces that you can apply in your projects.

Let's dive in and explore the fascinating world of cloud technology together! 🚀

The Wake-Up Call

Three months ago, our security team flagged something concerning. Developers were feeding production logs, error messages, and configuration snippets to ChatGPT for debugging help.

The problem? Those logs contained customer identifiers, internal service names, and architectural details we definitely didn't want leaving our network.

We couldn't just block ChatGPT - developers needed AI assistance. The productivity gains were real. But we also couldn't keep hemorrhaging sensitive data to external APIs.

The requirements were clear:

AI agents need AWS access for legitimate automation tasks
Zero sensitive data leaves our AWS environment
Every action must be auditable
Principle of least privilege, always
No impact on developer velocity

That's when I started looking at Model Context Protocol (MCP) as a security boundary.

Understanding MCP as a Security Layer

Before diving into implementation, let's clarify what MCP actually does and why it matters for security.

Model Context Protocol is an open standard that sits between your AI agent and your resources. Think of it as a translator and gatekeeper combined.

Developer → AI Agent → MCP Server → AWS IAM → AWS Resources
                          ↓
                    Security Layer

The MCP server doesn't just pass requests through. It acts as a security boundary that:

Validates every request before execution
Translates AI intentions into specific AWS API calls
Enforces authentication and authorization
Logs everything for audit trails
Provides a single point of control

Why this matters: Instead of giving AI agents direct AWS credentials, you give them access to an MCP server that has carefully scoped permissions. The AI never touches AWS credentials. It doesn't even know they exist.

The Security Architecture

After several iterations, here's the pattern that survived production. I'll explain the thinking behind each layer.

Layer 1: Authentication Without Permanent Credentials

The first principle: no permanent credentials anywhere in the system.

Developers authenticate with our existing identity provider (Okta in our case). The identity provider issues a JWT token containing the user's identity and group memberships. The MCP server validates this JWT and issues a short-lived session token - 15 minutes, no exceptions.

Why 15 minutes? Long enough for a debugging session, short enough that a leaked token becomes useless quickly. If someone steals a session token, they have a 15-minute window at most. Compare that to permanent AWS credentials that work forever until manually revoked.

The MCP server never stores these tokens. They're validated, used, and discarded. When they expire, users re-authenticate. It's a minor inconvenience that prevents major security incidents.

Layer 2: Request Validation

This is where MCP shines as a security boundary. Every request goes through multiple validation checks:

Action Allowlist: The MCP server maintains a strict list of allowed AWS actions. If the AI requests something not on the list, it's blocked immediately. No wildcards, no "just in case" permissions.

Pattern Detection: I scan every request for dangerous patterns. Words like "delete", "terminate", "destroy" trigger additional scrutiny. Even if the action is technically allowed, suspicious patterns can block the request or require additional approval.

Parameter Sanitization: Before logging or processing, all sensitive parameters get redacted. Passwords, tokens, API keys - anything that looks like a credential gets replaced with [REDACTED] in logs. This prevents credential leakage through audit trails.

Rate Limiting: Each user gets a request budget. Exceed it, and requests start getting throttled. This prevents both accidental runaway scripts and intentional abuse.

The validation happens in milliseconds. Developers don't notice the overhead, but it's the difference between a secure system and a disaster waiting to happen.

Layer 3: AWS Execution with Scoped Permissions

The MCP server uses an IAM role with specific permissions. Not admin. Not power user. Just what's needed for legitimate use cases.

I started by listing every legitimate use case developers had:

Read CloudWatch logs for debugging
List S3 buckets to find data
Get objects from specific buckets
Query CloudWatch metrics for dashboards

Then I created IAM policies that allow exactly those actions and nothing else.

The key insight: Explicit denies for dangerous actions, even if they're not in the allow list. This protects against future policy changes or misconfigurations.

Example: Even if someone accidentally adds s3:* to the allow list, an explicit deny on s3:DeleteBucket still blocks it. Defense in depth.

Layer 4: Comprehensive Audit Trail

CloudTrail logs every AWS API call, but it doesn't capture the context we need. Who made the request? What was the AI prompt? What resources were accessed?

I built a custom logging layer that captures:

User identity (email, not just IAM role)
Original AI prompt (hashed, not stored in plain text)
AWS action requested
Resources accessed
Result (success/failure)
Execution time

All of this goes to CloudWatch Logs in structured JSON format. Now I can query: "Show me all S3 access by user X this week" or "What resources did the AI access when processing this prompt?"

The logs are immutable and retained for 90 days for compliance.

How We Built It

The deployment came down to three critical security decisions. Each one was driven by a specific threat we wanted to prevent.

Decision 1: Network Isolation Over Convenience

I put the MCP server in a completely separate VPC from production. No shared networks, no VPC peering, nothing. The only communication path is through VPC endpoints to AWS APIs.

Why this matters: If someone compromises the MCP server, they're trapped. No internet access means they can't exfiltrate data. No production VPC access means they can't pivot to other systems. They're stuck in a cage that only opens to specific AWS services.

I chose ECS Fargate because it gave me this isolation without the overhead of managing EC2 instances. No patching, no scaling configuration, just containers in a locked-down network.

The trade-off: More complex networking setup. But the security benefit was worth it. A compromised MCP server becomes useless to an attacker.

Decision 2: Explicit Denies as the Last Line of Defense

The IAM policy has two blocks: allows and denies. The allows are specific - exact actions on exact resources. But the denies are what keep me sleeping at night.

I explicitly deny all delete operations, all terminate operations, all IAM changes, and all KMS key operations. Even if someone misconfigures the allow block and adds s3:*, the deny on s3:DeleteBucket still holds.

Why this matters: Policies get changed. People make mistakes. The deny block is the safety net that catches those mistakes before they become incidents.

The trade-off: More rigid system. If we need to add a delete operation later, we have to modify both blocks. But that friction is intentional - it forces us to think twice about dangerous permissions.

Decision 3: Real-Time Alerting Over Post-Incident Analysis

I set up CloudWatch alarms that fire immediately when something looks wrong. High error rates, unusual request volumes, spikes in blocked actions - all trigger alerts to our security team's Slack channel.

Why this matters: Logs are great for forensics, but alerts prevent incidents. If the AI starts trying malicious actions, I want to know in real-time, not during next week's log review.

The alerts are tuned to avoid noise. More than 50 errors in 5 minutes is abnormal. More than 1,000 requests from one user in 5 minutes is suspicious. These thresholds came from watching normal usage patterns for a month.

The trade-off: Alert fatigue is real. We tune the thresholds monthly based on false positive rates. But I'd rather investigate a false alarm than miss a real attack.

What Broke (And How I Fixed It)

Issue 1: Permission Errors Everywhere

What happened: First deployment, every request failed with AccessDenied.

The problem: I was too restrictive. The IAM policy only allowed specific S3 buckets, but developers needed to list buckets first to know what existed.

The fix: Add s3:ListAllMyBuckets with a wildcard resource. Let them see what exists, but control what they can read. It's like letting someone see the library catalog without giving them keys to every book.

Lesson: Start with read-only list permissions, then restrict data access. Users need to discover resources before they can use them.

Issue 2: CloudTrail Logs Were Useless

What happened: CloudTrail showed the MCP server's actions, but not which user requested them.

The problem: All requests came from the same IAM role. No way to trace back to individual users.

The fix: Pass user context through custom CloudWatch Logs. Every MCP request gets logged with the user's email, the action requested, and the resources accessed. Now I can trace every action back to the person who requested it.

Lesson: CloudTrail alone isn't enough for multi-user systems. You need custom logging to capture user context.

Issue 3: AI Agents Tried Creative Exploits

What happened: The AI tried to chain commands to bypass restrictions.

Example request:

"First list the S3 buckets, then for each bucket, 
download all objects and search for passwords"

The problem: My validation checked individual actions, not sequences. The AI was trying to automate a multi-step attack.

The fix: Detect and block chaining attempts. Look for words like "then", "after that", "for each", "loop through". Force users to make explicit, separate requests for each action.

Lesson: AI agents are creative. They'll try to work around restrictions. You need to think like an attacker.

Issue 4: Rate Limiting Was Too Aggressive

What happened: Legitimate users hit rate limits during normal debugging sessions.

The problem: I set limits too low (10 requests per minute). Debugging often requires rapid iteration - check logs, adjust query, check again.

The fix: Tiered rate limits based on action type:

Read operations (Get, Describe): 100 requests per 5 minutes
List operations: 50 requests per 5 minutes
Write operations: 10 requests per 5 minutes

Read operations get higher limits because they're lower risk. Write operations stay restricted.

Lesson: One-size-fits-all rate limits don't work. Different actions have different risk profiles.

What I Learned

After three months in production, here's what actually matters:

1. Explicit Denies Are Your Friend

Don't rely on "not allowing" something. Explicitly deny dangerous actions. Even if someone misconfigures the allow rules, the denies hold.

I have explicit denies for:

All delete operations
All terminate operations
All IAM operations
All KMS key operations

These are the "break glass" protections. They prevent catastrophic mistakes.

2. Log Everything, But Make It Searchable

CloudTrail is great, but you need custom logs for MCP-specific context. I send everything to CloudWatch Logs with structured JSON.

Now I can query: "Show me all S3 access by user X in the last hour" or "What resources did the AI access when processing this prompt?"

The logs are immutable and retained for 90 days. If something goes wrong, I can reconstruct exactly what happened.

3. Sanitize Everything

Never log the actual AI prompts. They might contain sensitive data. I hash them instead.

You can still correlate requests (same hash = same prompt), but you're not storing potentially sensitive prompts in logs.

4. Network Isolation Matters

The MCP server runs in a private VPC with no internet access. It can only reach:

AWS API endpoints (via VPC endpoints)
Internal authentication service
CloudWatch Logs

If someone compromises the MCP server, they can't exfiltrate data. They're stuck in an isolated network.

5. Test Your Security Controls

I wrote tests to verify the security controls actually work. Tests like:

Verify delete operations are blocked
Verify IAM operations are blocked
Verify rate limits work
Verify audit logs capture user context

Run these tests in CI/CD. If they pass, your security controls are working. If they fail, you know immediately.

Alternative Approaches I Considered

Option 1: Direct IAM Roles for AI Agents

Pros:

Simpler architecture
No MCP server to maintain
Lower latency

Cons:

No request validation layer
Can't block dangerous patterns
Harder to audit user actions
AI has direct AWS credentials

Why I didn't use it: Too risky. One prompt injection and the AI could delete production resources. The MCP layer provides defense in depth.

Option 2: AWS Lambda as MCP Server

Pros:

Serverless, no infrastructure
Automatic scaling
Pay per request

Cons:

Cold starts (500ms+)
15-minute timeout limit
Harder to maintain state (rate limiting)
More complex networking

Why I didn't use it: Cold starts killed the developer experience. Waiting 500ms for every request was frustrating. Fargate has no cold starts.

Option 3: API Gateway + Lambda

Pros:

Built-in rate limiting
API key management
Request/response transformation

Cons:

More complex setup
Higher cost at scale
Still has Lambda cold starts
Overkill for internal use

Why I didn't use it: The built-in rate limiting was nice, but not worth the complexity for an internal tool. Fargate + ALB was simpler.

Best Practices That Actually Matter

1. Start With Read-Only

Deploy with read-only permissions first. Let developers use it for a week. Then gradually add write permissions based on actual needs.

This prevents over-permissioning. You'll discover what developers actually need, not what they think they need.

2. Use Separate AWS Accounts

Run the MCP server in a separate AWS account from your production workloads. Use cross-account roles for access.

If the MCP account is compromised, production is still isolated. It's an extra layer of defense.

3. Monitor for Anomalies

Set up CloudWatch alarms for unusual patterns:

High error rates (>50 errors in 5 minutes)
Unusual access patterns (>1,000 requests in 5 minutes)
Blocked actions (>100 blocks in 5 minutes)

These alerts go to your security team. Response time is critical.

4. Regular Security Reviews

Every month, review:

Which actions are being used most
Which permissions are never used (remove them)
Any blocked requests (are they legitimate needs?)
Rate limit effectiveness

Security isn't set-and-forget. It requires ongoing attention.

5. Document Everything

Create a runbook for common scenarios:

How to add a new allowed action
How to investigate suspicious activity
How to rotate credentials
How to handle a security incident

When something goes wrong at 2 AM, you'll be glad you documented it.

Summary

Three months in production taught me that securing AI agent access isn't about perfect security - it's about making attacks harder than they're worth while keeping developers productive.

The MCP pattern works because it gives you a single point of control. You're not trying to secure the AI agent itself. You're securing the gateway it uses to access your resources. That gateway validates every request, enforces least privilege, logs everything, and runs in an isolated network.

We went from developers sending production data to ChatGPT to having a secure, auditable system where AI agents help without creating risk. The benefit? No more 2 AM calls about data leaks.

Is it perfect? No. Can a determined attacker find ways around it? Probably. But it's dramatically better than the alternatives: giving AI agents direct AWS credentials or blocking AI tools entirely and watching developers find workarounds.

The key insight: Security is not about building walls. It's about building gates with guards. MCP is that gate.

"Security is not about building walls. It's about building gates with guards."

📌 Wrapping Up

Thank you for reading! I hope this gave you practical ideas for securing AI agent access in your environment.

Found this useful?

❤️ Like if it helped you think through your security approach
🦄 Unicorn if you're implementing this pattern
💾 Save for your next security review
🔄 Share with your security team

Follow me for more on:

AWS security patterns
AI/ML infrastructure
Cloud architecture
DevSecOps practices

💡 What's Next

I'm working on a follow-up article about monitoring and alerting for MCP deployments. Follow for updates.

Also exploring: Multi-region MCP deployments and disaster recovery patterns.

🌐 Portfolio & Work

Explore my full body of work, certifications, and architecture projects:

👉 Visit My Website

🛠️ Services I Offer

Looking for hands-on guidance with cloud security or AI infrastructure?

Cloud Security Architecture (AWS / Azure)
AI/ML Infrastructure Design
Security Audit & Remediation
Technical Writing & Documentation
Architecture Reviews
1:1 Technical Mentorship

🤝 Let's Connect

Questions about implementing this pattern? Drop a comment or connect with me on LinkedIn.

For consulting or technical discussions: simplynadaf@gmail.com

Stay secure! 🔒