👋 Hey there, tech enthusiasts!
I'm Sarvar, a Cloud Architect with a passion for transforming complex technological challenges into elegant solutions. With extensive experience spanning Cloud Operations (AWS & Azure), Data Operations, Analytics, DevOps, and Generative AI, I've had the privilege of architecting solutions for global enterprises that drive real business impact. Through this article series, I'm excited to share practical insights, best practices, and hands-on experiences from my journey in the tech world. Whether you're a seasoned professional or just starting out, I aim to break down complex concepts into digestible pieces that you can apply in your projects.
Let's dive in and explore the fascinating world of cloud technology together! 🚀
The Wake-Up Call
Three months ago, our security team flagged something concerning. Developers were feeding production logs, error messages, and configuration snippets to ChatGPT for debugging help.
The problem? Those logs contained customer identifiers, internal service names, and architectural details we definitely didn't want leaving our network.
We couldn't just block ChatGPT - developers needed AI assistance. The productivity gains were real. But we also couldn't keep hemorrhaging sensitive data to external APIs.
The requirements were clear:
- AI agents need AWS access for legitimate automation tasks
- Zero sensitive data leaves our AWS environment
- Every action must be auditable
- Principle of least privilege, always
- No impact on developer velocity
That's when I started looking at Model Context Protocol (MCP) as a security boundary.
Understanding MCP as a Security Layer
Before diving into implementation, let's clarify what MCP actually does and why it matters for security.
Model Context Protocol is an open standard that sits between your AI agent and your resources. Think of it as a translator and gatekeeper combined.
Developer → AI Agent → MCP Server → AWS IAM → AWS Resources
↓
Security Layer
The MCP server doesn't just pass requests through. It acts as a security boundary that:
- Validates every request before execution
- Translates AI intentions into specific AWS API calls
- Enforces authentication and authorization
- Logs everything for audit trails
- Provides a single point of control
Why this matters: Instead of giving AI agents direct AWS credentials, you give them access to an MCP server that has carefully scoped permissions. The AI never touches AWS credentials. It doesn't even know they exist.
The Security Architecture
After several iterations, here's the pattern that survived production. I'll explain the thinking behind each layer.
Layer 1: Authentication Without Permanent Credentials
The first principle: no permanent credentials anywhere in the system.
Developers authenticate with our existing identity provider (Okta in our case). The identity provider issues a JWT token containing the user's identity and group memberships. The MCP server validates this JWT and issues a short-lived session token - 15 minutes, no exceptions.
Why 15 minutes? Long enough for a debugging session, short enough that a leaked token becomes useless quickly. If someone steals a session token, they have a 15-minute window at most. Compare that to permanent AWS credentials that work forever until manually revoked.
The MCP server never stores these tokens. They're validated, used, and discarded. When they expire, users re-authenticate. It's a minor inconvenience that prevents major security incidents.
Layer 2: Request Validation
This is where MCP shines as a security boundary. Every request goes through multiple validation checks:
Action Allowlist: The MCP server maintains a strict list of allowed AWS actions. If the AI requests something not on the list, it's blocked immediately. No wildcards, no "just in case" permissions.
Pattern Detection: I scan every request for dangerous patterns. Words like "delete", "terminate", "destroy" trigger additional scrutiny. Even if the action is technically allowed, suspicious patterns can block the request or require additional approval.
Parameter Sanitization: Before logging or processing, all sensitive parameters get redacted. Passwords, tokens, API keys - anything that looks like a credential gets replaced with [REDACTED] in logs. This prevents credential leakage through audit trails.
Rate Limiting: Each user gets a request budget. Exceed it, and requests start getting throttled. This prevents both accidental runaway scripts and intentional abuse.
The validation happens in milliseconds. Developers don't notice the overhead, but it's the difference between a secure system and a disaster waiting to happen.
Layer 3: AWS Execution with Scoped Permissions
The MCP server uses an IAM role with specific permissions. Not admin. Not power user. Just what's needed for legitimate use cases.
I started by listing every legitimate use case developers had:
- Read CloudWatch logs for debugging
- List S3 buckets to find data
- Get objects from specific buckets
- Query CloudWatch metrics for dashboards
Then I created IAM policies that allow exactly those actions and nothing else.
The key insight: Explicit denies for dangerous actions, even if they're not in the allow list. This protects against future policy changes or misconfigurations.
Example: Even if someone accidentally adds s3:* to the allow list, an explicit deny on s3:DeleteBucket still blocks it. Defense in depth.
Layer 4: Comprehensive Audit Trail
CloudTrail logs every AWS API call, but it doesn't capture the context we need. Who made the request? What was the AI prompt? What resources were accessed?
I built a custom logging layer that captures:
- User identity (email, not just IAM role)
- Original AI prompt (hashed, not stored in plain text)
- AWS action requested
- Resources accessed
- Result (success/failure)
- Execution time
All of this goes to CloudWatch Logs in structured JSON format. Now I can query: "Show me all S3 access by user X this week" or "What resources did the AI access when processing this prompt?"
The logs are immutable and retained for 90 days for compliance.
How We Built It
The deployment came down to three critical security decisions. Each one was driven by a specific threat we wanted to prevent.
Decision 1: Network Isolation Over Convenience
I put the MCP server in a completely separate VPC from production. No shared networks, no VPC peering, nothing. The only communication path is through VPC endpoints to AWS APIs.
Why this matters: If someone compromises the MCP server, they're trapped. No internet access means they can't exfiltrate data. No production VPC access means they can't pivot to other systems. They're stuck in a cage that only opens to specific AWS services.
I chose ECS Fargate because it gave me this isolation without the overhead of managing EC2 instances. No patching, no scaling configuration, just containers in a locked-down network.
The trade-off: More complex networking setup. But the security benefit was worth it. A compromised MCP server becomes useless to an attacker.
Decision 2: Explicit Denies as the Last Line of Defense
The IAM policy has two blocks: allows and denies. The allows are specific - exact actions on exact resources. But the denies are what keep me sleeping at night.
I explicitly deny all delete operations, all terminate operations, all IAM changes, and all KMS key operations. Even if someone misconfigures the allow block and adds s3:*, the deny on s3:DeleteBucket still holds.
Why this matters: Policies get changed. People make mistakes. The deny block is the safety net that catches those mistakes before they become incidents.
The trade-off: More rigid system. If we need to add a delete operation later, we have to modify both blocks. But that friction is intentional - it forces us to think twice about dangerous permissions.
Decision 3: Real-Time Alerting Over Post-Incident Analysis
I set up CloudWatch alarms that fire immediately when something looks wrong. High error rates, unusual request volumes, spikes in blocked actions - all trigger alerts to our security team's Slack channel.
Why this matters: Logs are great for forensics, but alerts prevent incidents. If the AI starts trying malicious actions, I want to know in real-time, not during next week's log review.
The alerts are tuned to avoid noise. More than 50 errors in 5 minutes is abnormal. More than 1,000 requests from one user in 5 minutes is suspicious. These thresholds came from watching normal usage patterns for a month.
The trade-off: Alert fatigue is real. We tune the thresholds monthly based on false positive rates. But I'd rather investigate a false alarm than miss a real attack.
What Broke (And How I Fixed It)
Issue 1: Permission Errors Everywhere
What happened: First deployment, every request failed with AccessDenied.
The problem: I was too restrictive. The IAM policy only allowed specific S3 buckets, but developers needed to list buckets first to know what existed.
The fix: Add s3:ListAllMyBuckets with a wildcard resource. Let them see what exists, but control what they can read. It's like letting someone see the library catalog without giving them keys to every book.
Lesson: Start with read-only list permissions, then restrict data access. Users need to discover resources before they can use them.
Issue 2: CloudTrail Logs Were Useless
What happened: CloudTrail showed the MCP server's actions, but not which user requested them.
The problem: All requests came from the same IAM role. No way to trace back to individual users.
The fix: Pass user context through custom CloudWatch Logs. Every MCP request gets logged with the user's email, the action requested, and the resources accessed. Now I can trace every action back to the person who requested it.
Lesson: CloudTrail alone isn't enough for multi-user systems. You need custom logging to capture user context.
Issue 3: AI Agents Tried Creative Exploits
What happened: The AI tried to chain commands to bypass restrictions.
Example request:
"First list the S3 buckets, then for each bucket,
download all objects and search for passwords"
The problem: My validation checked individual actions, not sequences. The AI was trying to automate a multi-step attack.
The fix: Detect and block chaining attempts. Look for words like "then", "after that", "for each", "loop through". Force users to make explicit, separate requests for each action.
Lesson: AI agents are creative. They'll try to work around restrictions. You need to think like an attacker.
Issue 4: Rate Limiting Was Too Aggressive
What happened: Legitimate users hit rate limits during normal debugging sessions.
The problem: I set limits too low (10 requests per minute). Debugging often requires rapid iteration - check logs, adjust query, check again.
The fix: Tiered rate limits based on action type:
- Read operations (Get, Describe): 100 requests per 5 minutes
- List operations: 50 requests per 5 minutes
- Write operations: 10 requests per 5 minutes
Read operations get higher limits because they're lower risk. Write operations stay restricted.
Lesson: One-size-fits-all rate limits don't work. Different actions have different risk profiles.
What I Learned
After three months in production, here's what actually matters:
1. Explicit Denies Are Your Friend
Don't rely on "not allowing" something. Explicitly deny dangerous actions. Even if someone misconfigures the allow rules, the denies hold.
I have explicit denies for:
- All delete operations
- All terminate operations
- All IAM operations
- All KMS key operations
These are the "break glass" protections. They prevent catastrophic mistakes.
2. Log Everything, But Make It Searchable
CloudTrail is great, but you need custom logs for MCP-specific context. I send everything to CloudWatch Logs with structured JSON.
Now I can query: "Show me all S3 access by user X in the last hour" or "What resources did the AI access when processing this prompt?"
The logs are immutable and retained for 90 days. If something goes wrong, I can reconstruct exactly what happened.
3. Sanitize Everything
Never log the actual AI prompts. They might contain sensitive data. I hash them instead.
You can still correlate requests (same hash = same prompt), but you're not storing potentially sensitive prompts in logs.
4. Network Isolation Matters
The MCP server runs in a private VPC with no internet access. It can only reach:
- AWS API endpoints (via VPC endpoints)
- Internal authentication service
- CloudWatch Logs
If someone compromises the MCP server, they can't exfiltrate data. They're stuck in an isolated network.
5. Test Your Security Controls
I wrote tests to verify the security controls actually work. Tests like:
- Verify delete operations are blocked
- Verify IAM operations are blocked
- Verify rate limits work
- Verify audit logs capture user context
Run these tests in CI/CD. If they pass, your security controls are working. If they fail, you know immediately.
Alternative Approaches I Considered
Option 1: Direct IAM Roles for AI Agents
Pros:
- Simpler architecture
- No MCP server to maintain
- Lower latency
Cons:
- No request validation layer
- Can't block dangerous patterns
- Harder to audit user actions
- AI has direct AWS credentials
Why I didn't use it: Too risky. One prompt injection and the AI could delete production resources. The MCP layer provides defense in depth.
Option 2: AWS Lambda as MCP Server
Pros:
- Serverless, no infrastructure
- Automatic scaling
- Pay per request
Cons:
- Cold starts (500ms+)
- 15-minute timeout limit
- Harder to maintain state (rate limiting)
- More complex networking
Why I didn't use it: Cold starts killed the developer experience. Waiting 500ms for every request was frustrating. Fargate has no cold starts.
Option 3: API Gateway + Lambda
Pros:
- Built-in rate limiting
- API key management
- Request/response transformation
Cons:
- More complex setup
- Higher cost at scale
- Still has Lambda cold starts
- Overkill for internal use
Why I didn't use it: The built-in rate limiting was nice, but not worth the complexity for an internal tool. Fargate + ALB was simpler.
Best Practices That Actually Matter
1. Start With Read-Only
Deploy with read-only permissions first. Let developers use it for a week. Then gradually add write permissions based on actual needs.
This prevents over-permissioning. You'll discover what developers actually need, not what they think they need.
2. Use Separate AWS Accounts
Run the MCP server in a separate AWS account from your production workloads. Use cross-account roles for access.
If the MCP account is compromised, production is still isolated. It's an extra layer of defense.
3. Monitor for Anomalies
Set up CloudWatch alarms for unusual patterns:
- High error rates (>50 errors in 5 minutes)
- Unusual access patterns (>1,000 requests in 5 minutes)
- Blocked actions (>100 blocks in 5 minutes)
These alerts go to your security team. Response time is critical.
4. Regular Security Reviews
Every month, review:
- Which actions are being used most
- Which permissions are never used (remove them)
- Any blocked requests (are they legitimate needs?)
- Rate limit effectiveness
Security isn't set-and-forget. It requires ongoing attention.
5. Document Everything
Create a runbook for common scenarios:
- How to add a new allowed action
- How to investigate suspicious activity
- How to rotate credentials
- How to handle a security incident
When something goes wrong at 2 AM, you'll be glad you documented it.
Summary
Three months in production taught me that securing AI agent access isn't about perfect security - it's about making attacks harder than they're worth while keeping developers productive.
The MCP pattern works because it gives you a single point of control. You're not trying to secure the AI agent itself. You're securing the gateway it uses to access your resources. That gateway validates every request, enforces least privilege, logs everything, and runs in an isolated network.
We went from developers sending production data to ChatGPT to having a secure, auditable system where AI agents help without creating risk. The benefit? No more 2 AM calls about data leaks.
Is it perfect? No. Can a determined attacker find ways around it? Probably. But it's dramatically better than the alternatives: giving AI agents direct AWS credentials or blocking AI tools entirely and watching developers find workarounds.
The key insight: Security is not about building walls. It's about building gates with guards. MCP is that gate.
"Security is not about building walls. It's about building gates with guards."
📌 Wrapping Up
Thank you for reading! I hope this gave you practical ideas for securing AI agent access in your environment.
Found this useful?
- ❤️ Like if it helped you think through your security approach
- 🦄 Unicorn if you're implementing this pattern
- 💾 Save for your next security review
- 🔄 Share with your security team
Follow me for more on:
- AWS security patterns
- AI/ML infrastructure
- Cloud architecture
- DevSecOps practices
💡 What's Next
I'm working on a follow-up article about monitoring and alerting for MCP deployments. Follow for updates.
Also exploring: Multi-region MCP deployments and disaster recovery patterns.
🌐 Portfolio & Work
Explore my full body of work, certifications, and architecture projects:
🛠️ Services I Offer
Looking for hands-on guidance with cloud security or AI infrastructure?
- Cloud Security Architecture (AWS / Azure)
- AI/ML Infrastructure Design
- Security Audit & Remediation
- Technical Writing & Documentation
- Architecture Reviews
- 1:1 Technical Mentorship
🤝 Let's Connect
Questions about implementing this pattern? Drop a comment or connect with me on LinkedIn.
For consulting or technical discussions: simplynadaf@gmail.com
Stay secure! 🔒
Top comments (0)