DEV Community: Ismail Kovvuru

AWS S3 Cross-Account Uploads Failing with 403 AccessDenied

Ismail Kovvuru — Tue, 28 Oct 2025 16:01:32 +0000

Learn how a simple missing permission in an AWS S3 Access Point policy caused 403 AccessDenied errors in cross-account uploads, even when IAM roles and bucket policies were correct. Step-by-step fix and prevention guide inside.

A User saw 500 Internal Server Error on uploads. Tracing showed an S3 AccessDenied (403) coming from an S3 Access Point that belonged to a different AWS account.

The bucket policy allowed cross-account writes, but the Access Point policy did not include the Lambda role ARN — S3 blocked the request.

Fix: add the source Lambda’s role ARN to the Access Point policy (or use an alternative cross-account pattern). After the Access Point policy was updated, uploads succeeded.

Below is a clear, step-by-step explanation of what happened, why it happened and how to fix and prevent it written so engineers and managers both can follow.

1. what, where, when and how the problem showed up

What happened (observable symptom):
User reported File uploads to S3 are failing. The user-facing API returned 500 Internal Server Error.

Where in the system:
Uploads flow: Frontend → API Gateway → Lambda (in Account A) → S3 Access Point → S3 bucket (owned by Account B). The Access Point resource was in another AWS account (Account B).

When it happened:
At runtime when Lambda attempted to PutObject through the Access Point to the destination account.

How it presented in traces and logs:

Frontend initiated an upload request via API Gateway.
API Gateway invoked a Lambda function.
Lambda attempted an S3 PutObject operation.
S3 returned 403 AccessDenied.
The upstream API returned 500 Internal Server Error to the user, masking the actual permission failure from S3.

This translation of the 403 into a 500 response made troubleshooting initially misleading.

Why it mattered / got missed initially:

Team checked: bucket health , IAM role with broad permissions , no network issues.
The subtlety: S3 Access Points have their own resource policies (separate from bucket policy). Even though the bucket policy allowed cross-account writes, the Access Point policy did not include the Lambda role ARN as a principal — S3 denied the operation at the Access Point layer.

2. Root cause

S3 Access Point is a resource that can have its own policy. When using an Access Point, S3 enforces both the bucket policy and the Access Point policy. In this case:

The destination bucket’s policy allowed cross-account writes.
The Access Point policy did not allow the Lambda’s role (principal) from the source account.
S3 rejected the PutObject with AccessDenied (403) at the Access Point layer.
The Lambda (or API Gateway) didn’t translate that permission error into a meaningful client response, so the client only saw 500 Internal Server Error.

Lesson: When cross-account operations use S3 Access Points, check the Access Point policy not just the bucket policy or IAM role.

3. The exact fix that worked

Fix performed: Update the S3 Access Point policy in the destination account (Account B) to include the source Lambda execution role ARN from Account A as an allowed principal for s3:PutObject (and other relevant S3 actions).

After redeploy: Uploads succeeded and the API returned 200 OK.

4. Concrete commands & policy examples

Important: replace account IDs, ARNs, access point names and regions with your own.

4.1 Inspect current Access Point policy (destination account)

aws s3control get-access-point-policy \
  --account-id 222233334444 \
  --name app-uploads \
  --region us-east-1

If there is no policy returned, that itself is relevant: the Access Point may be implicitly more restrictive.

4.2 Minimal Access Point policy that allows a Lambda role in another account to PutObject

access-point-policy.json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowLambdaPutFromAccountA",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111122223333:role/lambda-exec-role"
      },
      "Action": [
        "s3:PutObject",
        "s3:PutObjectAcl"
      ],
      "Resource": "arn:aws:s3:us-east-1:222233334444:accesspoint/app-uploads/object/*"
    }
  ]
}

Apply it:

aws s3control put-access-point-policy \
  --account-id 222233334444 \
  --name app-uploads \
  --policy file://access-point-policy.json \
  --region us-east-1

4.3 Example bucket policy (destination account) that is compatible

A bucket policy can additionally allow writes; Access Point policy must explicitly allow the principal too.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111122223333:role/lambda-exec-role"
      },
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::my-bucket/*",
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "111122223333"
        }
      }
    }
  ]
}

4.4 Test a PutObject using the Access Point ARN (from source account)

aws s3api put-object \
  --bucket arn:aws:s3:us-east-1:222233334444:accesspoint/app-uploads \
  --key test.txt \
  --body ./test.txt \
  --region us-east-1

Expected success: 200 OK.
If AccessDenied, confirm both Access Point policy and bucket policy include appropriate principals and conditions.

5. Why the API showed `500` instead of `403` (and how to avoid masking)

What happened: Lambda got a 403 from S3, but either:

Lambda code didn't catch/translate the exception and defaulted to an internal error, or
API Gateway integration mapping converted the Lambda error into a generic 500.

How to avoid masking in future: catch S3 errors and return meaningful HTTP status codes to clients.

Example (Node.js Lambda) — catch and propagate S3 errors:

const AWS = require('aws-sdk');
const s3 = new AWS.S3();

exports.handler = async (event) => {
  try {
    await s3.putObject({
      Bucket: 'arn:aws:s3:us-east-1:222233334444:accesspoint/app-uploads',
      Key: 'file.txt',
      Body: Buffer.from('hello')
    }).promise();

    return { statusCode: 200, body: 'OK' };

  } catch (err) {
    if (err.code === 'AccessDenied' || err.statusCode === 403) {
      return { statusCode: 403, body: 'Upload blocked: Access denied' };
    }
    console.error('Unexpected S3 error', err);
    return { statusCode: 500, body: 'Internal Server Error' };
  }
};

Recommendation: surface the correct HTTP code for client-facing errors, and log the full S3 error payload for troubleshooting.

6. When to use this solution, and when not to

When to use: include the source role ARN in Access Point policy

Use when you need direct cross-account writes through an Access Point (multi-tenant access patterns, VPC-restricted access, or when Access Points are an architecture requirement).
Use when Access Points are used to manage fine-grained access to a large bucket by many consumers across accounts.

Pros:

Fine-grained control at Access Point level.
Scopes access specifically to that Access Point and object path (less blast radius than bucket policy alone).
Works well with Access Point features (VPC restrictions, policy scoping).

Cons / Risks:

Requires managing principals across accounts — can be error-prone if roles rotate or change names.
Policies can get complex; need automation for correctness.

When not to use — alternatives

If you don't require Access Points, or cross-account Access Point policy management is heavy for your org, consider alternatives:

Alternative A — Cross-account AssumeRole (recommended for programmatic cross-account access)

Create a role in the destination account (Account B) that allows s3:PutObject on the bucket. Grant Account A permission to assume that role.
The Lambda in Account A sts:AssumeRole into the role in Account B and call S3 with temporary credentials. This avoids needing to manage resource policies referencing Account A principals.

When to choose: if you prefer IAM role trust relationships and more centralized control; good for service-to-service cross-account interactions.

Sketch:

In Account B, create S3WriteRole with bucket PutObject permissions. Trust policy allows arn:aws:iam::111122223333:role/lambda-exec-role (or the Account A principal) to assume it.
In Lambda (Account A), call sts.assumeRole to get temporary credentials, then call S3.

Pros: centralized role in destination; simpler to audit.
Cons: adds STS usage and role assumption step.

Alternative B — Pre-signed URLs

Generate pre-signed PUT URL in Account B or in Account A via a role in Account B. Frontend uses that URL to upload directly to S3. No cross-account policy needs on Access Point if signed correctly.

When to choose: when user uploads from browser/mobile and you want to avoid long-lived credentials or cross-account writes from Lambda.

Pros: simple client flow, least privilege on server.
Cons: signature management; cannot perform server-side transformations before upload.

Alternative C — Use bucket policies (no Access Point)

If Access Points not required, direct bucket policies that allow cross-account principals might be simpler.

When to choose: single-tenant use, or small number of trusted accounts.

7 . Diagnostics checklist / playbook (step-by-step)

If an S3 upload fails and you see a 500 or 403, run through this checklist:

Trace the request path — which resource did the client actually hit? (Direct bucket ARN, or Access Point ARN?)

Check Lambda code: what Bucket value does it pass to S3? If it uses an Access Point ARN, note the account id in that ARN.

Check CloudTrail for S3 Data Events (PutObject) to see the exact errorCode/errorMessage. CloudTrail shows which principal was used and whether the error was AccessDenied.
Check Access Point policy (destination account):

   aws s3control get-access-point-policy --account-id DEST --name APNAME

Check bucket policy (destination account) for PutObject allow/deny statements and aws:SourceAccount or aws:SourceArn conditions.
Confirm the principal: ensure the policy includes the Lambda's execution role ARN or the appropriate account principal.
Check Lambda IAM (source account): verify Lambda has s3:PutObject in its IAM permissions for the target resource (if using assumed role pattern, ensure assume role is allowed).
Check VPC / endpoint restrictions: Access Points can be restricted to VPCs — confirm the call originates from an allowed place.
Reproduce with AWS CLI using the same ARN to see exact error:

   aws s3api put-object --bucket arn:aws:s3:us-east-1:DEST:accesspoint/APNAME --key t.txt --body t.txt

Fix policy, then re-test. If still failing, enable S3 Server Access Logging or review CloudTrail events.
Avoid masking: update Lambda error handling to surface S3 error codes to client and log full stack trace.

8. Prevention: automation, monitoring & best practices

Automated checks

Add CI/CD checks that verify Access Point policies and bucket policies include required principals (for cross-account flows). Eg: run aws s3control get-access-point-policy in a test job that compares expected principals.
Use infrastructure as code (Terraform/CloudFormation) for Access Points and policies so cross-account principals are reviewed and versioned.

Observability

Enable S3 Data Events in CloudTrail for sensitive buckets — this records PutObject and will show AccessDenied.
Enable S3 Server Access Logging on the bucket for additional forensic data.
Log Lambda exceptions and include error codes – make them searchable in CloudWatch Logs.

Policy hygiene

Prefer least privilege: grant only the actions required (s3:PutObject, possibly s3:PutObjectAcl).
Where possible use Condition elements (aws:SourceAccount, aws:SourceArn) to reduce risk.

Testing

Add integration tests that perform a real PutObject via the Access Point as part of deployment pipelines (run under a sandbox account).

9. Example: Add assume-role alternative (step by step)

1. In destination account (Account B) create role CrossAccountS3Writer:

Trust policy grants sts:AssumeRole to the source account (Account A) or the Lambda role.
Permissions: s3:PutObject on arn:aws:s3:::my-bucket/*.

2. In source account (Lambda) use STS to assume CrossAccountS3Writer and call S3 using those temporary credentials.

Why: avoids scattering destination resource policies referencing many source principals; centralizes control in the destination account.

10. Short checklist

Identify if using Access Point ARN or direct bucket ARN.
Check Access Point policy in destination account.
Check bucket policy for PutObject allow/deny.
Confirm principal ARN (Lambda role) is included in Access Point/bucket policy or that role assumption is configured.
Reproduce with AWS CLI.
Fix policy or implement assume-role & retest.
Add automated tests & CI checks.
Improve error handling so S3 403 becomes client 403, not 500.
Document in team KB with example policies.

11. Recommendations (practical & actionable)

Short term: Fix the Access Point policy to include the Lambda role ARN and re-test. Update Lambda to surface 403s clearly. Add a short KB entry referencing this incident.
Medium term: Decide a cross-account access pattern for your org (Access Point policy vs assume-role vs presigned URLs). Standardize it and codify with IaC.
Long term: Automate policy verification in CI, enable CloudTrail S3 data events for critical buckets, and add integration test coverage that performs a test upload through the exact path used in production.

12. Conclusion

This was a classic example of policy layering catching teams off guard: even with a permissive bucket policy and a qualified IAM role in the caller account, the Access Point — being a separate resource — enforces its own policy. The symptom (500) masked the true permission error (403) until you traced all hops and checked the Access Point policy.

OpenAI Launches ChatGPT Go in India 🇮🇳

Ismail Kovvuru — Tue, 19 Aug 2025 09:24:47 +0000

OpenAI launches ChatGPT Go in India — a new affordable subscription plan with 10x more messages, images, and uploads, plus UPI payments. Available now for just ₹399.

Introducing ChatGPT Go in India

OpenAI is excited to announce the launch of ChatGPT Go in India, a brand-new subscription tier designed to make advanced AI features more accessible and affordable.

Nick Turley, Head of ChatGPT, shared the news of ChatGPT Go’s launch in India via LinkedIn, highlighting OpenAI’s commitment to making advanced AI more affordable and accessible. Here what he shared about news..

India is one of the fastest-growing communities of ChatGPT users in the world. From students and professionals to creators and entrepreneurs, people are using ChatGPT every day to learn, write, design, and build. One of the top requests from our Indian users has been: “Can you make premium ChatGPT features more affordable and easier to subscribe to locally?”

With ChatGPT Go, we are delivering on that request.

What ChatGPT Go Offers

Subscribers of ChatGPT Go in India can now enjoy major enhancements compared with the free tier:

10x higher message limits – have extended conversations for deeper learning and projects.
10x more image generations – create visuals, mockups, and ideas without worrying about running out.
10x more file uploads – work seamlessly with larger and more complex documents.
2x longer memory – experience smarter, more personalized interactions, with ChatGPT remembering more across sessions.

And all of this is available at just ₹399 per month.

Easier Subscriptions with INR and UPI

To make the process seamless, we’ve localized ChatGPT subscriptions in India:

All plans are now priced in Indian Rupees (INR).
Payments can be made conveniently through UPI, India’s most widely used digital payment system.

This ensures that upgrading is as simple as sending a UPI payment — no international barriers, no complexity.

Why We Chose India First

We’re launching ChatGPT Go in India first because of the country’s vibrant AI adoption. Millions of people here use ChatGPT for learning, productivity, and creativity, and affordability has been one of the most requested improvements.

Starting in India allows us to learn directly from this growing community before expanding the plan to other countries.

Our Commitment

OpenAI’s mission is to make AI useful, safe, and accessible to as many people as possible. The launch of ChatGPT Go in India is part of that vision, giving more power and flexibility to users who want to do more with AI — at a price that works locally.

We look forward to hearing from our Indian community, learning from feedback, and continuing to improve ChatGPT for everyone.

AWS Lambda Response Streaming Now Supports 200 MB Payloads — 10 More!

Ismail Kovvuru — Tue, 12 Aug 2025 08:25:41 +0000

Big news for serverless developers — AWS Lambda’s response streaming limit has jumped from 20 MB → 200 MB by default.

That’s a 10× improvement, and it means fewer workarounds, less infrastructure, and simpler code.

What Changed?

Previously, if your Lambda needed to return more than 20 MB (e.g., PDFs, images, analytics dumps, AI output), you had to:

Split or chunk the payload,
Compress it, or
Upload it to S3 and return a presigned URL.

That added more code, higher latency, and extra moving parts.

Now:

You can stream up to 200 MB directly from Lambda to your client.

No S3 handoff. No chunking. Just stream data as it’s ready.

Quick Facts

Limit Type	New (Streaming)	Old	Applies To
Response Size (Streaming)	200 MB	20 MB	Node.js managed & custom runtimes
Response Size (Classic)	6 MB	6 MB	Buffered responses only
Input Payload Size	6 MB	6 MB	Still capped

Note: The 6 MB input limit still exists for requests into Lambda.

For uploads, use S3 presigned URLs or direct integrations.

Why This Matters

This change simplifies architectures for:

Media Delivery — Serve full podcast episodes, videos, and big PDFs right from Lambda.
Generative AI — Stream long text, images, or audio without waiting for completion.
Data-heavy APIs — Deliver large CSVs, reports, or datasets without extra handling.

Because it’s streaming, users start receiving data immediately — improving Time-to-First-Byte (TTFB) and responsiveness.

Developer Tips

Use Streaming When:
- Responses are large and sequential.
- Processing produces output progressively (AI or analytics).
Mind the Input Limit:
- For uploads > 6 MB, use S3 presigned URLs, API Gateway → S3.
Runtime Support:
- Works out-of-the-box in Node.js managed runtimes.
- Other languages require a custom runtime.

In Summary

AWS Lambda’s new 200 MB response streaming means:

Less code
Lower latency
Fewer AWS services to manage

For most big-response use cases, Lambda can now be a direct delivery engine — no buckets, no detours.

Tip: Combine with CloudFront or edge-optimized APIs for blazing global performance.

AWS Lambda Now Supports Avro for Kafka – No More Manual Deserialization

Ismail Kovvuru — Thu, 31 Jul 2025 04:11:42 +0000

AWS Lambda now natively supports Avro serialization when consuming messages from Kafka (MSK or self-managed). No more custom deserialization, external Avro libraries, or complex schema plumbing — just plug in your schema and go.

What’s New?

AWS has introduced built-in Avro support for Lambda when triggered by Kafka topics. Previously, using Avro meant:

Writing custom decoding code
Including avro-python3, fastavro, or Java Avro libs in your Lambda package
Handling schema registry integration manually
Slower cold starts due to large deployment bundles

Now, Lambda does all this automatically:

Just point it at your Kafka topic
Provide a schema reference (e.g., AWS Glue Schema Registry)
Lambda will decode the Avro-encoded payload natively

Example Setup (Minimal Configuration)

{
  "eventSource": "kafka",
  "avroSchema": "arn:aws:glue:us-east-1:123456789012:schema/TelemetryEvent"
}

Lambda Code (Python Example)

def handler(event, context):
    for record in event['records']:
        print("Decoded payload:", record['value'])  # Avro already parsed

Why This Is a Big Deal

Benefits for DevOps & Engineers

Old Workflow	New Native Support
Custom Avro decoding logic	Built-in decoding
Extra library dependencies	None needed
Bigger Lambda package	Smaller, faster function
Manual schema registry calls	Automatic schema parsing

Real-World Use Cases

This update is critical for teams dealing with:

High-volume real-time Kafka pipelines
IoT event ingestion and telemetry
Financial systems using Avro schemas
Microservices with schema-first contracts

Supported Schema Registries

AWS Glue Schema Registry
Confluent Schema Registry

Both handle:

Versioning
Compatibility (backward, forward, full)
Schema evolution

So your Lambda function always works with the latest compatible schema version.

Caveats & Best Practices

Messages must be Avro-encoded and registered in a supported registry.
Stick to schema compatibility rules to avoid decoding failures.
This feature is specific to Kafka triggers — not available for SQS, Kinesis, or DynamoDB streams.

📚 Official AWS Resources

Thoughts

This is a low-profile but high-impact update from AWS. For teams working with real-time, schema-driven data flows, this makes Lambda more production-ready and DevOps-friendly.

Kubernetes App Slow? Fix DNS, Mesh & Caching.Not Node Scaling

Ismail Kovvuru — Wed, 30 Jul 2025 03:10:40 +0000

A production Kubernetes application started showing latency issues during peak hours. User reports flagged slow page loads and inconsistent response times.

The initial reaction from the infrastructure team was to add more nodes to the cluster. However, before provisioning additional compute resources, a deeper inspection was performed.
But throwing compute at a latency issue is inefficient and costly.

Root Causes Identified:

Too many service hops
CoreDNS misconfigurations
No caching for repeated API calls

Real Solutions (Not More Nodes)

1. Use a Service Mesh

Why:
Service meshes like Istio or Linkerd reduce latency by enabling intelligent routing, retries, timeouts, and circuit breaking — optimizing pod-to-pod communication.

Commands (Istio example):

# Install Istio
istioctl install --set profile=demo -y

# Enable automatic sidecar injection
kubectl label namespace default istio-injection=enabled

# Deploy your app with mesh support
kubectl apply -f your-app-deployment.yaml

2. Fix CoreDNS Configuration

Why:
Misconfigured CoreDNS leads to excessive lookups, especially if upstream/loop plugins are misused or timeouts are high.

Steps:

Inspect CoreDNS logs:

kubectl logs -n kube-system -l k8s-app=kube-dns

Edit CoreDNS ConfigMap:

kubectl edit configmap coredns -n kube-system

Optimizations:

Set appropriate TTLs:

  cache 30

Minimize forward retries:

  forward . /etc/resolv.conf {
    max_concurrent 1000
  }

3. Add Caching for Repeated API Calls

Why:
If microservices make repeated calls to the same APIs (e.g., auth, config, pricing), caching avoids redundant processing and DNS lookups.

Options:

In-app memory cache (LRU, Redis)
Sidecar caching with tools like Varnish or NGINX

Example using Redis:

# Python Flask example
cache = redis.StrictRedis(host='redis', port=6379, db=0)

@app.route("/get-price")
def get_price():
    price = cache.get("product_price")
    if price:
        return price
    price = get_price_from_db()
    cache.set("product_price", price, ex=300)
    return price

Why Not Add Nodes?

Slowness here is due to latency, not resource exhaustion.
Adding nodes increases cost without resolving the actual bottlenecks.
Smart tuning of networking and caching brings greater results for less overhead.

Why Only These Solutions?

These three changes gave maximum impact with minimal cost:

| Issue                     | Solution       | Reason Chosen                           |
| ------------------------- | -------------- | --------------------------------------- |
| Excessive pod-to-pod hops | Service Mesh   | Centralized control + efficient routing |
| DNS resolution delays     | CoreDNS tuning | Reduced lookup overhead                 |
| Repeated API calls        | API Caching    | Faster responses + reduced backend load |

Are There Better Alternatives?

Other options like:

Upgrading to Cilium for eBPF-based networking.
Using Headless Services to bypass kube-proxy.
Tuning Kube-proxy, reducing iptables hops.

But those are deeper infra-level changes. For most real-world apps, the mesh + DNS fix + caching strategy solves 80% of latency complaints without scaling costs.

Always Measure Before Scaling

Before scaling compute nodes, check usage metrics:

kubectl top pods --all-namespaces

Final :

Before scaling your Kubernetes cluster, optimize what you already have:

Service mesh for communication efficiency
CoreDNS tuning to reduce DNS latency
Caching to eliminate repetitive calls

These are network-aware, cost-effective, and production-ready solutions that bring measurable performance improvements.

AWS Docs MCP Server for DevOps Assistance

Ismail Kovvuru — Tue, 29 Jul 2025 05:00:41 +0000

Using AWS Docs MCP Server for Accurate DevOps Assistance

As DevOps engineers, we constantly search for precise AWS documentation while configuring services like EC2, IAM, or Lambda. However, general-purpose AI tools often hallucinate, and navigating docs manually slows us down.

Enter the AWS Docs MCP Server — a minimal, reliable tool from AWS Labs that lets you query documentation locally with zero hallucination.

Whether you're integrating it with Amazon Q, using it inside VS Code, or embedding it into CLI tooling, this server returns direct responses from official AWS docs, fast and accurately.

What Is AWS Docs MCP Server?

The Managed Control Plane (MCP) Server is an open-source server provided by AWS Labs. It's designed to return real-time AWS documentation in response to structured requests.

It’s especially useful when paired with:

Amazon Q Developer Agent in VS Code
Custom CLI tools
IDE integrations or doc-bots

Instead of relying on generative models, it gives back only what's in the AWS documentation — no more guessing.

How to Set It Up

You can configure the server using this JSON block (used in tools like Amazon Q):

"mcpServers": {
  "awslabs.aws-documentation-mcp-server": {
    "command": "uvx",
    "args": ["awslabs.aws-documentation-mcp-server@latest"],
    "env": {
      "FASTMCP_LOG_LEVEL": "ERROR",
      "AWS_DOCUMENTATION_PARTITION": "aws"
    },
    "disabled": false,
    "autoApprove": []
  }
}

What This Script Does

Let’s break it down:

What it does:

This script configures your system (or dev assistant like Amazon Q) to connect to the AWS Docs MCP Server, which will respond to queries using real AWS documentation.

Why we're doing it:

Because we want factual, fast, non-AI-generated answers when we ask questions about AWS services — especially for DevOps use cases like configuring IAM policies, EC2 launch templates, or CloudFormation syntax.

What output you’ll get:

Once active, this server returns structured responses like:

The exact syntax or example for a specific AWS CLI command
JSON or YAML config snippets from AWS docs
Official links and metadata from AWS documentation
Answers scoped only to actual AWS services, nothing made up

Explanation of Each Field

Field	What it Means
`command`	Runs the MCP server using `uvx` (a runtime like Deno or Node).
`args`	Downloads and runs the latest version of the AWS Docs MCP Server.
`env.FASTMCP_LOG_LEVEL`	Sets log level to `ERROR` (suppress warnings/info logs).
`env.AWS_DOCUMENTATION_PARTITION`	Specifies that it should fetch documentation only for the public AWS partition.
`disabled`	When `false`, keeps the server active.
`autoApprove`	Used to control whether requests are auto-approved (leave empty for manual).

Why It Matters for DevOps

Benefit	Description
No hallucination	Data is pulled directly from AWS docs — no assumptions.
Fast	Lightweight server optimized for local/dev workflows.
Pluggable	Easily integrates with IDEs, terminals, or dev assistants like Amazon Q.
Expandable	AWS Labs also provides MCPs for other services (like DynamoDB, CloudWatch, etc.).

Glossary of Terms Used

Term	Meaning
MCP	Managed Control Plane — a server that responds to structured dev tool queries
Amazon Q	AWS’s AI-powered coding assistant (like Copilot or ChatGPT but AWS-specific)
uvx	A JavaScript/TypeScript runtime used to execute MCP servers
Partition	Refers to the AWS partition (e.g., `aws`, `aws-cn`, `aws-us-gov`)
FASTMCP	The lightweight framework powering these control plane servers
Hallucination	When AI generates false or made-up information
autoApprove	Controls automatic acceptance of prompts by MCP
AWS Labs	AWS’s GitHub organization for experimental/open-source tools

Quick Start Links

Example Usage with Amazon Q: AWS MCP Servers

Thoughts

If you're building or maintaining AWS infrastructure, you need reliable answers fast. This tool gives you that — straight from the source.

Whether you're writing CloudFormation, troubleshooting S3 policies, or scripting with Boto3, the AWS Docs MCP Server becomes a trusted backend that supercharges your DevOps workflows.

Tried it out? Share your integrations or feedback in the comments!

Using Amazon EFS with AWS Lambda: Persistent File Storage

Ismail Kovvuru — Sun, 27 Jul 2025 16:07:32 +0000

Unlock persistent, low-latency storage in AWS Lambda using Amazon EFS. This 2025-ready guide covers real-world use cases, step-by-step Terraform and CloudFormation examples, performance tuning, cost comparisons (vs S3, /tmp, Elasticache), and DevOps best practices for scalable serverless architecture.

AWS Lambda + EFS: Scalable File Storage for Serverless Workloads

When we think of AWS Lambda, we often imagine stateless, short-lived functions with tight constraints on storage and memory. But what if your function needs to read or write persistent data across multiple invocations? Enter Amazon EFS (Elastic File System) — AWS’s fully managed NFS solution that can be mounted directly to your Lambda functions within a VPC.

What This Means in Simple Terms

By mounting EFS to your Lambda function, you unlock:

Persistent storage — survives between invocations
Shared access — multiple Lambdas, containers, and EC2s can access the same filesystem
Large file support — process GB-level datasets, ML models, PDFs, images, and more
Faster ML inference or image manipulation — with mounted models or binaries

Why /tmp Isn’t Enough in Lambda

Lambda’s /tmp directory:

Is ephemeral — data is wiped once the execution environment is reclaimed
Has a hard 512MB size limit
Cannot be shared across Lambda instances or invocations

If your application needs persistent storage, multi-function collaboration, or handling large files, /tmp becomes a serious bottleneck.

What Is Amazon EFS?
Amazon EFS is a fully managed, elastic, network file system accessible from:

EC2
ECS/Fargate
Lambda (via VPC & access point)

With EFS:

You can store unlimited files with standard POSIX permissions
Mount the same filesystem across services
Pay only for what you use (GB/month + I/O if using provisioned mode)

Why Use Amazon EFS with Lambda?

EFS makes serverless Lambda functions stateful and collaborative. Key advantages:

Persistent Storage – Data stays even after Lambda shuts down
Large File Support – No 512MB cap like /tmp
Shared Access – Share data across multiple Lambda functions and invocations
Zero Manual Scaling – Automatically grows with usage
POSIX File Permissions – Secure multi-tenant file access

Cost Comparison: EFS vs S3 vs /tmp vs ElastiCache

Storage Type	Pricing	Pros	Cons
EFS (Standard)	~\$0.30/GB/month + I/O (provisioned if enabled)	Scalable, shared, persistent, POSIX	Latency > S3, costlier
S3	~\$0.023/GB/month	Durable, cheap, static hosting	Not writable by Lambda directly without SDK
Lambda `/tmp`	Free up to 512MB	Fastest, local	Ephemeral, size-limited
ElastiCache	~\$0.02/GB/hour	Low-latency, real-time caching	In-memory only, not persistent

How to Use EFS with Lambda – Step-by-Step

Prerequisites:

Existing Lambda function
VPC with private subnets
EFS in the same region and VPC

Setup Steps:

Create EFS File System
Create EFS Access Point
Configure Mount Targets (across AZs)
Update Security Groups (allow NFS from Lambda to EFS)
Attach Lambda to VPC & Mount via Access Point

Python Sample (Writing to EFS):

with open("/mnt/efs/mylog.txt", "a") as f:
    f.write("Function invoked at: {}\n".format(datetime.now()))

Real-World Use Case

Let’s say you have a startup running ML inference via Lambda. The ML model file is 400MB. Storing it in /tmp (limited to 512MB) or S3 (slower, not POSIX-compliant) doesn’t scale. With EFS mounted, your Lambda loads the model instantly from the shared file system — improving cold start times and inference speed drastically.

When Should You Use Lambda with EFS?

Use Case	Should Use EFS?
ML model loading	Yes
Image/Video rendering (FFmpeg, PIL)	Yes
Data sharing across Lambdas	Yes
Large file access during function	Yes
Small, stateless functions	No
Low-latency DB-style caching	Prefer ElastiCache or DynamoDB

Cost Comparison: EFS vs S3 vs /tmp vs ElastiCache

Feature	EFS	S3	/tmp	ElastiCache
Persistent	Yes	Yes	No
Shared	Yes	Yes	No	Yes
POSIX-Compliant	Yes	No	Yes	No
Max File Size	∞	∞	512MB	N/A
Speed	Fast	Slower	Fast	Fast
Pricing Model	Per GB/month + I/O	Per request + storage	Free	Per node/hour

Reference: EFS Pricing, S3 Pricing, ElastiCache Pricing

Terraform Script to Mount EFS with Lambda

Here’s a production-ready snippet:

# Create a VPC (or use an existing one)
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

# EFS File System
resource "aws_efs_file_system" "lambda_efs" {
  performance_mode = "generalPurpose"
  lifecycle_policy {
    transition_to_ia = "AFTER_7_DAYS"
  }
  throughput_mode = "bursting"
}

# Create Mount Targets in each AZ subnet
resource "aws_efs_mount_target" "efs_mt" {
  for_each        = toset(["subnet-az1", "subnet-az2"])  # Replace with real subnet IDs
  file_system_id  = aws_efs_file_system.lambda_efs.id
  subnet_id       = each.key
  security_groups = [aws_security_group.efs_sg.id]
}

# Security Group for EFS (Allow NFS)
resource "aws_security_group" "efs_sg" {
  vpc_id = aws_vpc.main.id
  ingress {
    from_port   = 2049
    to_port     = 2049
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/16"]
  }
}

# Lambda Function Role
resource "aws_iam_role" "lambda_exec" {
  name = "lambda-exec-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [{
      Action = "sts:AssumeRole",
      Effect = "Allow",
      Principal = {
        Service = "lambda.amazonaws.com"
      }
    }]
  })
}

# Lambda Function
resource "aws_lambda_function" "efs_lambda" {
  function_name = "efs_lambda_demo"
  runtime       = "python3.9"
  handler       = "index.handler"
  role          = aws_iam_role.lambda_exec.arn
  filename      = "lambda-deploy.zip"

  vpc_config {
    subnet_ids         = ["subnet-az1", "subnet-az2"]
    security_group_ids = [aws_security_group.efs_sg.id]
  }

  file_system_config {
    arn              = aws_efs_access_point.ap.arn
    local_mount_path = "/mnt/efs"
  }
}

# EFS Access Point
resource "aws_efs_access_point" "ap" {
  file_system_id = aws_efs_file_system.lambda_efs.id
  posix_user {
    uid = 1000
    gid = 1000
  }
  root_directory {
    path = "/lambda"
    creation_info {
      owner_gid   = 1000
      owner_uid   = 1000
      permissions = "755"
    }
  }
}

Explanation of Key Sections

aws_efs_file_system: Creates the shared storage.
aws_efs_mount_target: Mounts the EFS to subnets in your VPC (required).
aws_efs_access_point: Simplifies access, ensures POSIX permissions.
aws_lambda_function: The Lambda is now “wired” to use /mnt/efs as a real folder.
vpc_config: Required, since Lambda with EFS must run inside a VPC.

Output & Testing

If done correctly, your Lambda will:

Write/read files to /mnt/efs
Persist across invocations
Share data with other compute instances

Test tip: Add a print(os.listdir("/mnt/efs")) to verify it works.

Pre-Checks Before Using Lambda + EFS

Checklist Item	Reason
Must deploy Lambda inside a VPC	Required for EFS access
Ensure proper security groups	NFS (2049) must be open between Lambda and EFS
Avoid cold start bloat	Mounting adds latency; prefer provisioned concurrency for speed
Set correct POSIX permissions	Access Point must match your Lambda UID/GID

When Not to Use This Setup

If your function runs fine within /tmp
If you're doing low-latency key-value access → use DynamoDB or ElastiCache
If you don’t want VPC complexity (adding NAT Gateway, etc.)

CloudFormation Snippet: Mounting EFS to Lambda

AWSTemplateFormatVersion: '2010-09-09'
Description: Attach Amazon EFS to AWS Lambda for shared persistent storage

Resources:

  MyVPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsSupport: true
      EnableDnsHostnames: true
      Tags:
        - Key: Name
          Value: LambdaVPC

  MySubnet1:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref MyVPC
      CidrBlock: 10.0.1.0/24
      AvailabilityZone: !Select [ 0, !GetAZs "" ]
      MapPublicIpOnLaunch: true

  MySecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow Lambda access to EFS
      VpcId: !Ref MyVPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 2049
          ToPort: 2049
          CidrIp: 0.0.0.0/0 # for public test only — restrict in prod

  MyEFS:
    Type: AWS::EFS::FileSystem
    Properties:
      Encrypted: true
      PerformanceMode: generalPurpose

  MyMountTarget:
    Type: AWS::EFS::MountTarget
    Properties:
      FileSystemId: !Ref MyEFS
      SubnetId: !Ref MySubnet1
      SecurityGroups:
        - !Ref MySecurityGroup

  MyAccessPoint:
    Type: AWS::EFS::AccessPoint
    Properties:
      FileSystemId: !Ref MyEFS
      PosixUser:
        Uid: "1000"
        Gid: "1000"
      RootDirectory:
        CreationInfo:
          OwnerUid: "1000"
          OwnerGid: "1000"
          Permissions: "750"
        Path: "/lambda"

  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole
        - arn:aws:iam::aws:policy/AmazonElasticFileSystemClientReadWriteAccess

  MyLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: MyEFSLambda
      Runtime: python3.12
      Handler: index.lambda_handler
      Code:
        ZipFile: |
          import os
          def lambda_handler(event, context):
              with open("/mnt/efs/test.txt", "w") as f:
                  f.write("Hello from Lambda using EFS!")
              return "File written to EFS"
      Role: !GetAtt LambdaExecutionRole.Arn
      VpcConfig:
        SubnetIds:
          - !Ref MySubnet1
        SecurityGroupIds:
          - !Ref MySecurityGroup
      FileSystemConfigs:
        - Arn: !GetAtt MyAccessPoint.Arn
          LocalMountPath: /mnt/efs

Explaining the Lambda + EFS CloudFormation Stack (Simplified Professional View)

This CloudFormation setup provisions a serverless Lambda function connected to Amazon EFS, enabling persistent, shared storage across invocations. Here's what each component does and why it matters:

1. Amazon EFS (`AWS::EFS::FileSystem`)

Provides a durable, shared NFS file system.
Ideal for ML models, media processing, or large binaries.

2. Mount Target

Enables EFS access from a Lambda in a VPC.
One target per AZ is needed.
Ensures private, secure VPC routing to EFS.

3. Access Point

Defines a safe mount path (e.g., /lambda).
Sets POSIX permissions to prevent access issues.
AWS-recommended for multi-function access and permission control.

4. Security Groups

Allows port 2049 (NFS) from Lambda to EFS.
Lambda’s SG is whitelisted by the EFS SG.
Ensures secure, scoped connectivity inside the VPC.

5. IAM Role

Grants VPC, EFS, and CloudWatch Logs access.
Follows least-privilege policy: only elasticfilesystem:ClientMount, logs:*, etc.

6. Lambda Function

Connects to private subnets + security group.
Mounts EFS at /mnt/efs using the Access Point.
Can read/write files across invocations — perfect for ML, media, or temp file sharing.

When to Use:

ML inference with large model files.
Persistent shared data (e.g., thumbnails, binaries).
Temporary storage exceeding /tmp’s 512MB limit.

Pre-checks:

Lambda must run in private subnets.
Ensure port 2049 is open between SGs.
Use Access Points to avoid permission issues.
Role must have correct EFS + VPC permissions.

Recommendations & Best Practices

Scenario	Use Lambda + EFS?	Why?
ML inference with >100MB models	Yes	Faster load time
Sharing temp files between functions	Yes	Persistent access
Simple file logging	Use CloudWatch	Cheaper
Real-time caching	Use ElastiCache	Lower latency

Key Concepts and Terminology

Term	Description
Lambda	Serverless compute function that runs code in response to events
EFS (Elastic File System)	Scalable, network-based file storage for AWS compute services
VPC (Virtual Private Cloud)	Isolated network environment for deploying AWS resources
Access Point	Named mount path with identity and permissions
Throughput Mode	Controls how EFS scales I/O (Bursting or Provisioned)
/tmp	Lambda's 512MB ephemeral local storage, cleared after container reset
Elasticache	In-memory key-value store (e.g., Redis) used for caching/shared data
Cold Start	Latency on first invocation due to provisioning resources

References

Skip OS Shutdown on EC2: Instantly Stop or Terminate Instances with AWS CLI v2.15+

Ismail Kovvuru — Sun, 27 Jul 2025 14:56:47 +0000

In 2025, AWS introduced a powerful feature for Amazon EC2 that allows you to skip the operating system shutdown process during stop or terminate operations. Using the --skip-os-shutdown flag, you can immediately shut down or terminate an EC2 instance—without waiting for in-OS cleanup scripts, disk flushes, or graceful exits.

This flag is a game-changer for DevOps pipelines, failover automation, blue-green deployments, and ephemeral test environments where speed takes priority over control.

What Is `--skip-os-shutdown`?

By default, EC2 sends a signal to the guest operating system to gracefully shut down when you stop or terminate an instance. This allows the OS to:

Run shutdown scripts
Flush memory to disk
Notify monitoring agents

With --skip-os-shutdown, this signal is bypassed. The instance is instantly powered off or terminated, just like yanking the power cord from a physical server.

CLI Example:

aws ec2 stop-instances \
  --instance-ids i-1234567890abcdef0 \
  --skip-os-shutdown

Applies To:

stop-instances
terminate-instances

Available via:

AWS CLI (v2.15+)
AWS Console
SDKs (progressively being updated)

Prerequisite: AWS CLI v2.15 or Later

The --skip-os-shutdown flag is only supported in AWS CLI version 2.15.0 and above.

Check Your Version:

aws --version
# Should return: aws-cli/2.15.0 or newer

How to Upgrade:

macOS/Linux:

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install --update

Windows:

Download: AWS CLI v2 MSI Installer

Best Use Cases for This Flag

This feature is ideal when fast instance termination or shutdown is required and you’re okay with skipping cleanup steps:

Scenario	Why It’s Useful
High Availability (HA)	Rapidly remove and replace unhealthy EC2s during failover.
Blue-Green Deployments	Quickly decommission old environments.
CI/CD Test Runners	Instantly clean up short-lived EC2s after test jobs.
Spot Instances	Avoid delays in auto-replacement workflows.
Chaos Engineering	Force fail nodes to test system resilience.

Situations & Tools Where You Should Not Use It

Using --skip-os-shutdown bypasses critical OS-level processes. Here’s a breakdown of where this could cause problems:

1. Stateful Applications / Databases

System	Why Not
MySQL, PostgreSQL, MongoDB, Redis	May lose in-memory or unflushed data; corrupt journals.
Elasticsearch, Kafka	Disrupts cluster state or causes shard inconsistency.
`tmpfs` / RAM-backed processes	Data is lost immediately.

2. EC2 Lifecycle Tools & Shutdown Hooks

Service	Impact
Auto Scaling Lifecycle Hooks	Terminating hook (`EC2_INSTANCE_TERMINATING`) may never trigger.
OpsWorks / Elastic Beanstalk	Skips teardown, logs, and state tracking.
Custom AMIs with shutdown scripts	Scripts for cleanup or final logging won’t run.

3. CI/CD Agents & Test Frameworks

Tool	Risk
Jenkins EC2 Agents	Results/logs not archived, jobs may break.
GitHub Actions (self-hosted)	Workspace cleanup skipped.
CodeDeploy	Lifecycle events like `BeforeBlockTraffic` skipped.

4. Monitoring & Security Systems

Tool	Risk
CloudWatch Agent, Datadog, New Relic	Final logs/metrics may not be sent.
GuardDuty, OSQuery, Falco	Missed signals, incomplete audits.
SOC2 / ISO-certified environments	Could breach audit policies requiring graceful shutdowns.

5. EC2 Features Requiring Shutdown

Feature	Issue
EC2 Hibernate	Hibernate state won’t be saved.
AMI Creation	Image may be inconsistent or dirty.
CloudWatch Alarms	May falsely trigger due to skipped signal.
Auto Recovery	May misinterpret health check failures.

Behind the Scenes: What Happens Internally?

When --skip-os-shutdown is used:

No ACPI signal is sent to the OS
AWS forcibly stops the instance at the hypervisor level
RAM is purged
OS cleanup or shutdown logic is entirely bypassed

It’s essentially a hard power-off, not a shutdown.

EBS Volume Considerations

Attached EBS volumes remain intact, but:
- Applications with delayed writes may leave incomplete data
- File systems not mounted with sync or not journaled may be inconsistent

Use sync, fsync(), or journaling file systems to minimize risk.

Monitoring Caveats

Skipping shutdown can confuse your observability stack:

Tool	Risk
CloudWatch Metrics	May report inaccurate CPU/memory usage.
Datadog	Final flush of metrics/logs skipped.
Prometheus	Node exporter may not unregister cleanly.

Recommendation:
Use EventBridge rules to trigger compensating actions after termination.

Advanced Workflow Example: Blue-Green Deployment

Here’s how --skip-os-shutdown fits into a zero-downtime deploy:

Deploy new version (Green) → health check passes
Route traffic to Green
Drain and disable Blue
Use --skip-os-shutdown to instantly remove Blue
Trigger cleanup Lambda via CloudWatch/EventBridge
Free up EBS/ENI/IP and complete deploy

Summary Table

Aspect	Recommendation
Best For	Spot instances, failover systems, fast teardown
Avoid In	Databases, CI/CD runners, audit-compliant systems
What’s Skipped	Shutdown scripts, disk flush, monitoring agents
CLI Requirement	AWS CLI v2.15.0+
EBS Risk	Data may be inconsistent without flushing
Not For	Hibernate, AMI creation, critical shutdown processes

Thoughts

--skip-os-shutdown is a powerful flag that prioritizes speed over safety. Use it in automated, stateless environments, but avoid it anywhere state, compliance, or graceful teardown matters.

Think of it as a tool in your belt—not a default behavior.

Resources

Related Blogs:

For more topics visit Medium , Red Signals and Dubniumlabs

AWS EC2 Placement Groups Explained (2025): High Availability, Cluster, Spread & Partition with Real-World Automation

Ismail Kovvuru — Fri, 18 Jul 2025 06:57:14 +0000

Learn to master AWS EC2 Placement Groups in 2025: Cluster, Spread, and Partition. When to use them, real scenarios, Terraform, CloudFormation & CLI scripts, capacity pitfalls, practical HA design, and troubleshooting.

Many AWS engineers never touch Placement Groups — and then wonder why their HPC jobs fail to launch or why a single rack outage kills their entire Kafka cluster.

EC2 Placement Groups are one of AWS’s least understood yet most powerful tools for designing true rack-level High Availability, low-latency HPC, and fault-domain isolation at scale.

What is an EC2 Placement Group?

By default, when you launch EC2s, AWS places them where it best fits capacity & resiliency needs.

A Placement Group (PG) lets you override that default — so you can:

Pack nodes together for max speed
Spread them apart for fault isolation
Split them into clear failure domains

The 3 Placement Group Types

| Type          | Best For                   | How It Works                                                     | Limits                                             | When to Use                                                      |
| ------------- | -------------------------- | ---------------------------------------------------------------- | -------------------------------------------------- | ---------------------------------------------------------------- |
|   Cluster     | HPC, ML, GPU               | Puts all nodes on same rack / rack set                           | Single AZ, practical limit is available hardware   | When you need ultra-low latency, high throughput                 |
|   Spread      | Small HA nodes             | Guarantees each node is on separate rack                         | 7 per AZ, can span AZs                             | When 3–7 nodes must survive single rack failure                  |
|   Partition   | Large distributed clusters | Divides nodes into partitions → racks → explicit failure domains | Up to 7 partitions per AZ, each can hold many EC2s | When you need explicit rack-level fault domains for big clusters |

Cluster Placement Group — What It Is ?
A Cluster Placement Group tries to pack your EC2 instances as close together as possible, usually in the same rack, to minimize latency and maximize throughput.

Used for:

HPC jobs
Distributed ML training
Big data pipelines needing tight east-west traffic

Why not always use it?
If the rack runs out of slots, your launch fails — so you must design for capacity.

Always pair Cluster PG with Capacity Reservations for predictable launches.

+-----------------------------+
|      Cluster PG (AZ-a)      |
+-----------------------------+
|                             |
| [ Rack A ]                  |
| -------------------------   |
| EC2-1   EC2-2   EC2-3       |
| EC2-4   EC2-5   EC2-6       |
|                             |
+-----------------------------+

Spread Placement Group — What It Is
A Spread Placement Group places each EC2 on a completely separate rack with separate hardware, power, and network.

Used for:

Small sets of must-not-fail nodes
HA quorum services (web servers, payment front ends, DNS)

Limit: Only 7 per AZ — so multi-AZ Spread PGs are common.

+-----------------------------------------------+
|               Spread PG (AZ-a)                |
+-----------------------------------------------+
|                                               |
| [ Rack A ]   EC2-1                            |
| [ Rack B ]   EC2-2                            |
| [ Rack C ]   EC2-3                            |
|                                               |
+-----------------------------------------------+

Partition Placement Group — What It Is
A Partition PG splits your cluster into partitions → mapped to rack sets → each acts as a fault domain.

Used for:

Large clusters like Kafka, Cassandra, Hadoop
If Rack A fails, only Partition 1 is affected → cluster keeps running

+-----------------------------+
|  Partition Placement Group  |
+-----------------------------+
|                             |
| Partition 1 → Rack A        |
| -------------------------   |
| EC2-1   EC2-2   EC2-3       |
|                             |
| Partition 2 → Rack B        |
| -------------------------   |
| EC2-4   EC2-5   EC2-6       |
|                             |
| Partition 3 → Rack C        |
| -------------------------   |
| EC2-7   EC2-8   EC2-9       |
+-----------------------------+

How Many EC2s? — Actual Limits

| Type          | Max per AZ                                     | Multi-AZ?                     | Notes                               |
| ------------- | ---------------------------------------------- | ----------------------------- | ----------------------------------- |
| 1.Cluster     | No strict limit, practical capacity limit only |   One AZ only                 | Constrained by available rack slots |
| 2.Spread      | 7 per AZ                                       |   Create one Spread PG per AZ | Each instance on different rack     |
| 3.Partition   | 7 partitions per AZ (each can hold 100s)       |   One AZ only                 | Must replicate across AZs yourself  |

Why These Limits Exist

| Type          | Limit               | Why?                            | If Exceeded                          |
| ------------- | ------------------- | ------------------------------- | ------------------------------------ |
| 1.Cluster     | Hardware capacity   | All nodes must fit on same rack | `InsufficientInstanceCapacity` error |
| 2.Spread      | 7 per AZ            | AWS guarantees unique racks     | 8th instance won’t launch            |
| 3.Partition   | 7 partitions per AZ | 7 unique rack sets per AZ       | Error if you request 8+ partitions   |

These limits come from real hardware constraints — not arbitrary. So you must design for them.

Real Error Snippet

Trying to launch a big Cluster PG without capacity?

An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation:
We currently do not have sufficient m5.4xlarge capacity in the Availability Zone you requested.
Our system will be working on provisioning additional capacity.

How to fix:

Try this one later
Switch AZ
Use Capacity Reservation:

aws ec2 create-capacity-reservation \
  --instance-type m5.4xlarge \
  --instance-count 4 \
  --availability-zone us-east-1a

Terraform, CLI & CloudFormation — Explained

Terraform — Spread PG

# Define a Placement Group resource in AWS using Terraform
resource "aws_placement_group" "spread_pg" {
  name     = "my-spread-pg"   # Human-friendly name for this PG
  strategy = "spread"         # Placement strategy: Spread means each instance on a separate rack
}

# Launch an EC2 instance attached to the Spread Placement Group
resource "aws_instance" "web_node" {
  ami             = "ami-xxxxxxx"                 # Replace with your AMI ID (e.g., Amazon Linux)
  instance_type   = "t3.micro"                    # Choose your desired instance type
  placement_group = aws_placement_group.spread_pg.name  # Attach to the defined Spread PG
}

How to use:

Save as main.tf
Run:

terraform init   # Initialize Terraform
terraform plan   # See what will be created
terraform apply  # Create the PG and launch your instance

AWS CLI — Spread PG

# Create a Spread Placement Group named "my-spread-pg"
aws ec2 create-placement-group \
  --group-name my-spread-pg \
  --strategy spread

# Launch an EC2 instance into the Spread Placement Group
aws ec2 run-instances \
  --image-id ami-xxxxxxx \    # Replace with your AMI ID
  --instance-type t3.micro \  # Choose your instance type
  --placement GroupName=my-spread-pg   # Attach to your Spread PG

How to use:

Replace ami-xxxxxxx with a real AMI ID (e.g., Amazon Linux 2).
Run each command in your terminal (must be authenticated with aws configure). CloudFormation — Spread PG

Resources:
  # Define the Placement Group with Spread strategy
  MyPlacementGroup:
    Type: AWS::EC2::PlacementGroup
    Properties:
      Strategy: spread

  # Launch an EC2 instance inside the Placement Group
  MyInstance:
    Type: AWS::EC2::Instance
    Properties:
      ImageId: ami-xxxxxxx      # Replace with your AMI ID
      InstanceType: t3.micro    # Your desired instance type
      PlacementGroupName: !Ref MyPlacementGroup  # Reference the defined Spread PG

How to use:

Save as spread-pg.yaml
Deploy with:

aws cloudformation create-stack \
  --stack-name my-spread-stack \
  --template-body file://spread-pg.yaml

Tips

Always double-check your AMI ID — wrong region = launch failure.
Use terraform destroy to clean up your test infra.
Tag your Placement Groups (tags block in Terraform) for cost tracking.
Combine with an Auto Scaling Group if you need elasticity with rack-level isolation.

AWS Terms Cheat Sheet

| Term                 | What It Means                  |
| -------------------- | ------------------------------ |
| Placement Group      | Logical rack placement control |
| Cluster PG           | Same rack                      |
| Spread PG            | Separate racks                 |
| Partition PG         | Fault-domain racks             |
| Capacity Reservation | Guarantees physical slots      |
| AZ                   | Availability Zone              |
| ASG                  | Auto Scaling Group             |

Updated — Multi-AZ Placement Groups (2025)

1. Cluster PG:
Not multi-AZ capable. A Cluster Placement Group must keep all instances physically close on the same rack or rack set for ultra-low latency — this is only possible inside a single Availability Zone. Cross-AZ networking would destroy its main benefit.

2. Partition PG:
Not multi-AZ capable. A Partition Placement Group logically maps each partition to a unique rack set within one AZ to create explicit failure domains. AWS does not support Partition PGs that span multiple AZs. To achieve multi-AZ fault tolerance, deploy separate Partition PGs in each AZ and handle replication at the application layer (e.g., Kafka multi-AZ topic replication).

3. Spread PG:
Multi-AZ capable. Unlike Cluster or Partition, a Spread Placement Group can be deployed in multiple AZs by creating one Spread PG per AZ. Each guarantees that its instances are placed on separate racks in that specific AZ. Combining multiple Spread PGs across AZs gives you both rack-level failure isolation and AZ-level disaster recovery for small, critical node sets.

Best Practice: When using Spread PG for multi-AZ HA, plan separate subnets, separate Spread PGs, and properly configure cross-AZ load balancing (e.g., ALB or NLB) to keep traffic healthy if a rack or AZ fails.

Conclusion

Designing high-performance, fault-tolerant EC2 workloads is far more than just choosing the right instance type — it’s also about controlling where your instances run physically.

Placement Groups (Cluster, Spread, and Partition) are powerful tools for advanced engineers who truly need:

Ultra-low network latency (Cluster)
Rack-level fault tolerance for critical nodes (Spread)
Explicit fault domains for large clusters (Partition)

Are they recommended for every workload?
No. For most small or generic auto-scaling workloads, AWS’s default placement is good enough — using Placement Groups when you don’t need them can add unnecessary complexity and failure points.

When are they strongly recommended?

When you run HPC, distributed GPU, or tight MPI jobs needing rack-level speed (Cluster)
When you must guarantee that no single rack outage takes down all quorum nodes (Spread)
When you design large, stateful clusters that need explicit failure domains for fast recovery (Partition)

What to keep in mind before using them:

Always verify AZ capacity. Cluster PGs commonly hit Insufficient Instance Capacity if you request large node counts.
Combine Cluster PGs with Capacity Reservations for mission-critical HPC.
Design Spread PGs for multi-AZ from day one — never assume one AZ’s 7-instance limit will be enough for future scale.
Partition PGs do not magically replicate across AZs — you must handle multi-AZ replication at the app level.

Placement Groups work best when you deeply understand the physical infra behind AWS’s virtual promise. Used right, they unlock HA and throughput levels you can’t get from default placement alone. Used wrong, they cause launch failures or false sense of redundancy.

References — AWS Official Docs

You will find these topics useful

AWS Bedrock vs SageMaker JumpStart: Which One to Use for Your GenAI Use Case?

AWS Bedrock or SageMaker JumpStart? Deep-dive comparison for GenAI projects. Use cases, costs, and performance insights explained clearly.

dubniumlabs.blogspot.com

10 Proven kubectl Commands: The Ultimate 2025 AWS Kubernetes Guide

10 proven kubectl commands with examples for AWS DevOps in 2025. Master Kubernetes clusters, pods, deployments, services, and ultimate hands-on.

dubniumlabs.blogspot.com

Kubernetes Autoscaling Fails Silently. Here's Why and How to Fix It

Ismail Kovvuru — Sun, 06 Jul 2025 08:39:51 +0000

Why Your Kubernetes Cluster Autoscaler Isn’t Scaling?

Your pods are pending. The nodes aren’t scaling. Logs say nothing.

Sound familiar? You’re not alone.

Most Kubernetes engineers hit this silently frustrating issue. And the truth is, the autoscaler isn’t broken. It’s just misunderstood.

Here’s a sneak peek into what’s really going on:

What You’ll Learn in This Breakdown:

Why Cluster Autoscaler ignores some pods (even if they're pending)
How nodeSelector, taints, and affinity rules silently block scaling
What real Cluster Autoscaler logs actually mean
The hidden impact of PDBs, PVC zones, and priority classes
YAML: Before & After examples that fix scaling issues instantly
Terraform ASG configs for autoscaler to work properly
Observability patterns + self-healing strategies (Kyverno, alerts, CI/CD)

Here’s a Sample Fix:

Bad Pod Spec (won’t scale):

resources: {}
nodeSelector:
  instance-type: gpu

Fixed YAML (scales properly):

resources:
  requests:
    cpu: "250m"
    memory: "512Mi"
tolerations:
  - key: "app-node"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

Want the Full Breakdown?

I’ve published the entire guide, including:

Real autoscaler logs
Terraform IAM & ASG configs
YAML validation checks
Edge case scenarios no one talks about

👉 Read the full post here on RedSignals:
Why Kubernetes Cluster Autoscaler Fails — Fixes, Logs & YAML Inside

AWS Compute Wars 2025: EC2 vs Lambda vs Fargate vs EKS

Ismail Kovvuru — Wed, 25 Jun 2025 10:58:12 +0000

In 2025, choosing the right AWS compute service isn't just about knowing the features — it's about making the right decision for your workload, budget, and team.

As an AWS Engineer or DevOps Architect, you’ve likely asked yourself:

“Should I go with Lambda or ECS?”

“Is EC2 still relevant in 2025?”

“Do we really need to adopt EKS?”

I wrote a detailed blog post exploring these questions — backed by real-world experience, architecture patterns, and cost-performance tradeoffs.

What you’ll learn in the post:

When to choose EC2, Lambda, ECS, or EKS
Cost vs complexity vs control breakdown
Real use cases from cloud-native and hybrid teams
An architecture decision matrix you can actually use

Whether you’re designing new systems or modernizing legacy ones, this guide will help you make confident compute decisions in 2025 and beyond.

👉 Read the full post on Medium:

https://medium.com/@ismailkovvuru/aws-compute-services-in-2025-a-aws-engineers-guide-to-real-world-architecture-decisions-2f23f26524b2

I'd love your thoughts:

Which AWS compute service are you using in 2025 — and why?

Drop your stack or experience in the comments 👇

Let’s talk DevOps architecture!

One Container per Pod: Kubernetes Done Right

Ismail Kovvuru — Sun, 22 Jun 2025 13:29:16 +0000

Learn why running one container per pod is a Kubernetes best practice. Explore real-world fintech use cases, security benefits, and scaling advantages.

What Is a Pod in Kubernetes?

Definition:
A Pod is the smallest deployable unit in Kubernetes. It represents a single instance of a running process in your cluster.

A pod:

Can host one or more containers
Shares the same network namespace and storage volumes among all its containers
Is ephemeral — meant to be created, run, and replaced automatically when needed

Principle: “One Container per Pod”

Definition:
While Kubernetes supports multiple containers in one pod, the best practice is to run only one container per pod — the single responsibility principle applied at the pod level.

This principle makes each pod act like a microservice unit, cleanly isolated, focused, and independently scalable.

Why "One Container per Pod" in Production?

Isolation of Responsibility

Each container does one job — making:
Debugging easier
Logging cleaner
Ownership clear (dev vs. ops)

Scalability

You can:

Horizontally scale pods with a single container based on CPU/RAM/load
Apply pod autoscaling without worrying about co-packaged containers

Maintainability

Easier CI/CD pipelines
Smaller image sizes
Easier to test, upgrade, or rollback individually

Fault Tolerance

If one pod/container crashes:

Only that service is affected, not an entire coupled group

What If We Don’t Follow It?

|   Problem                    |   Consequence                                                |
| ----------------------------- | ------------------------------------------------------------ |
| Multiple unrelated containers | Tight coupling, hard to debug & test                         |
| Mixed logging/monitoring      | Noisy logs, ambiguous metrics                                |
| Co-dependency                 | Can’t scale or update services independently                 |
| Increased blast radius        | One container failure could affect the whole pod’s operation |

Where and How to Use This Principle

When to Use One Container per Pod

| Situation                        | Use One Container? | Why                                             |
| -------------------------------- | ------------------ | ----------------------------------------------- |
| Independent microservices        |   Yes              | Clean design, easy to manage                    |
| Stateless backend or API         |   Yes              | Scalability, fault isolation                    |
| Event-driven consumers           |   Yes              | Simple, lean, retry logic handled by controller |
| Data processors (e.g., ETL jobs) |   Yes              | Lifecycle-bound, logs isolated                  |

When to Consider Multiple Containers (with Caution)

| Use Case                | Type         | Description                                     |
| ----------------------- | ------------ | ----------------------------------------------- |
| Sidecar (logging/proxy) | Shared Scope | Shares volume or networking with main container |
| Init container          | Pre-start    | Runs setup script before app starts             |
| Ambassador pattern      | Gateway      | Acts as proxy, often combined with service mesh |

Benefits of Using One Container per Pod

| Benefit            | Description                                 |
| ------------------ | ------------------------------------------- |
| Simpler Deployment | Easy to define and deploy YAML              |
| Easier Monitoring  | Logs and metrics tied to one process        |
| Better DevOps Flow | Aligned with microservice CI/CD pipelines   |
| Container Reuse    | One container image = multiple environments |
| Rolling Updates    | Zero-downtime with **Deployment strategy**  |

How to Apply It in Practice

Kubernetes YAML Example (Single Container Pod):

apiVersion: v1
kind: Pod
metadata:
  name: my-api
spec:
  containers:
  - name: api-container
    image: my-api-image:v1
    ports:
    - containerPort: 8080

Best Applied In:

Production microservices
CI/CD pipelines
Infrastructure-as-Code (IaC) like Helm, Kustomize, Terraform
Monitoring dashboards (Grafana/Prometheus), because metrics/logs are clean

Recommendation

| Aspect                   | Recommendation                                                                  | Reason                   |
| ------------------------ | ------------------------------------------------------------------------------- | ------------------------ |
| **Production Use**       |   Strongly Yes                                                                  | Clean, scalable, secure  |
| **Learning/Dev**         |   Yes                                                                           | Easier to debug and test |
| **Multiple Containers?** | Use only with Sidecar or Init Containers when **tight coupling is intentional** |                          |

Breaking the Rule with Purpose: When to Use Sidecars, Init Containers, and OPA in Kubernetes

In Kubernetes, the principle of "One Container per Pod" is often recommended to maintain simplicity, separation of concerns, and ease of scaling. This approach ensures each pod does one thing well, following the Unix philosophy.

But let’s face it — production environments are complex.

There are real-world scenarios where this rule, while solid, becomes a bottleneck. For example:

What if your application needs a logging agent running alongside it?

What if you need to perform a setup task before the main container starts?

What if you need to enforce security policies on what gets deployed?

This is where Kubernetes-native Pod Patterns come into play. These aren’t workarounds — they’re intentional design features, battle-tested across thousands of production clusters.

Let’s dive into these patterns.

Kubernetes Pod Design Patterns: Definitions and Purposes

Before comparing use cases, here’s a quick overview of each pattern.
Sidecar Container
A sidecar is a secondary container in the same pod as your main app. It usually provides auxiliary features like logging, monitoring, service mesh proxies (e.g., Envoy in Istio), or data sync tools.

Example: Fluent Bit running as a sidecar to ship logs to a centralized logging system like ELK or CloudWatch.

Init Container

An Init Container runs before your main application container starts. It’s used for tasks that must complete successfully before the app begins, such as waiting for a database to become available or initializing a volume.

Example: A script that pulls secrets from AWS Secrets Manager before the main app starts.

Multi-Container Pod

Sometimes, you need multiple containers to share resources (like volumes or network namespaces) and work tightly together as a single unit. Kubernetes allows this via multi-container pods.

Example: An app + proxy architecture, where a caching proxy like NGINX shares a volume with the app to serve cached assets.

OPA (Open Policy Agent)

OPA is not a pod pattern, but a Kubernetes-integrated policy engine. It runs as an admission controller and evaluates policies before allowing workloads to be deployed.

Example: Prevent deploying pods that run as root or don’t have resource limits defined.

Use Case Comparison: When to Use What?

Let’s break it down by use case and see where single-container pods hold up — and where patterns like sidecars and init containers are essential.

|     Use Case                            |   Single Container |   Multi-Container (Sidecar/Init) |   Why / Benefit                                              |
| --------------------------------------- | ------------------- | --------------------------------- | ------------------------------------------------------------- |
| Simple microservice                     |   Recommended       |   Not Needed                     | Keeps it simple, isolated, and independently scalable         |
| App + logging agent                     |   Lacking          |   Use Sidecar                     | Sidecar can ship logs (e.g., Fluent Bit, Promtail) separately |
| Pre-start setup (e.g., init database)   |   Not Possible     |   Use Init Container              | Guarantees setup tasks are done before app runs               |
| App with service mesh                   |   Not Enough       |   Sidecar (e.g., Envoy)           | Enables traffic control, mTLS, tracing via proxies            |
| Deploy policy enforcement               |   With OPA Hook     |   OPA Enforced                   | Prevents insecure or non-compliant pod specs                  |
| App needs config from secrets manager   |   Missing Logic    |   Init or Sidecar                 | Pull secrets securely before runtime                          |
| Application needs tightly coupled logic |   Better Separate  |   Can Use                        | Only if logic cannot be decoupled (e.g., proxy + app pair)    |

The Trade-Offs You Should Be Aware Of

While these patterns are powerful, they're not always the right answer. Some trade-offs include:

| Pattern             | Pros                                                | Cons                                                                   |
| ------------------- | --------------------------------------------------- | ---------------------------------------------------------------------- |
| **Sidecar**         | Modular, reusable, supports observability           | Resource sharing, lifecycle management complexity                      |
| **Init Container**  | Clean init logic, enforces sequence                 | Adds to pod startup latency                                            |
| **Multi-Container** | Co-location simplifies some tightly coupled tasks   | Harder to scale independently, debugging more complex                  |
| **OPA**             | Declarative policy control, security at deploy time | Learning curve, requires policy writing and admission controller setup |

Should You Use These Patterns?

Yes — when the use case justifies it.

These patterns exist to make Kubernetes production-grade. But like any engineering decision, use them with intent, not out of trend.

Tip: Start with “One Container per Pod.” Break that rule only when there's a clear, repeatable reason — like logging, proxying, initialization, or security enforcement.

Real-World Scenario: Banking App Traffic Spike

Problem:
A banking application has high traffic for balance inquiries. Customers now also want OTPs and transaction alerts quickly. The app starts failing under load because:

Logging service can’t keep up

OTP system has race conditions

Application startup is unreliable during node restarts

Before — Problematic YAML (One Container per Pod)

balance-app.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: balance-app
spec:
  containers:
    - name: balance-app
      image: mybank/balance-check:v1
      ports:
        - containerPort: 8080

Before-Problematic Architecture (One Container, Poor Setup)

┌────────────────────────┐
│   User (Mobile/Web)    │
└────────┬───────────────┘
         │
         ▼
 ┌────────────────────────────┐
 │   balance-app Pod (v1)     │  ← Monolithic (One container handles everything)
 └────────────────────────────┘
         │
         ├── OTP call → Fails if OTP pod isn't ready
         └── Internal logs → Lost on crash

     🔻 Problems:
     - No readiness probe
     - No log persistence
     - OTP/MS failures break flow
     - Crash = no trace/debug

What’s going wrong?

| Component           | Issue                                                                                    |
| ------------------- | ---------------------------------------------------------------------------------------- |
|   Logging          | Logs are stored inside the container. Once the container crashes, logs are lost.         |
|   OTP Microservice | It runs in a separate pod. If OTP service isn't ready, the balance-app fails to connect. |
|   No Retry or Wait | There’s no mechanism to wait for OTP or DB to be ready before app starts                 |
|   Hard to Debug     | You can’t access logs post-mortem or know if failures happened at startup                |

After — Solved YAML (Still One Container per Pod, But Better)

*You don’t use Kubernetes patterns yet, but you improve by:
*

Mounting logs to a persistent volume
Adding readiness probes to avoid traffic until app is ready
Managing environment variables for external dependencies

balance-app-fixed.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: balance-app
spec:
  containers:
    - name: balance-app
      image: mybank/balance-check:v2
      ports:
        - containerPort: 8080
      env:
        - name: OTP_SERVICE_URL
          value: "http://otp-service:9090"
        - name: DB_HOST
          value: "bank-db"
      readinessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5
      volumeMounts:
        - name: logs
          mountPath: /app/logs
  volumes:
    - name: logs
      emptyDir: {}

After — Improved One-Container Setup (Same Pattern, Better Practice)

┌────────────────────────┐
│   User (Mobile/Web)    │
└────────┬───────────────┘
         │ HTTPS/API Call
         ▼
 ┌──────────────────────────────┐
 │ balance-app Pod (v2)         │  ← Still 1 container, but:
 │ - readinessProbe             │
 │ - env vars for OTP & DB      │
 │ - volumeMounts for logs      │
 └────────┬───────────────┬─────┘
          │               │
          │               ▼
          │        ┌──────────────┐
          │        │ OTP Service  │  ← Separate Pod
          │        └──────────────┘
          ▼
 ┌──────────────────────────────┐
 │ Logs Persisted (emptyDir)    │  ← Logs retained on crash
 └──────────────────────────────┘

Explanation of the Fixes

| Fix                          | Purpose                                                           |
| ---------------------------- | ----------------------------------------------------------------- |
| `readinessProbe`             | Prevents traffic until the app is actually ready                  |
| `env` variables for OTP & DB | External service URLs are injected in a clean, reusable way       |
| `volumeMounts` + `emptyDir`  | Stores logs outside container file system (won’t vanish on crash) |
| Still One Container          | Yes. No sidecar or init yet – just improved hygiene               |

Summary

| Version            | Containers | Resiliency | Logging               | Service Coordination      | Deployment Quality |
| ------------------ | ---------- | ---------- | --------------------- | ------------------------- | ------------------ |
|   Before           | 1          | Poor       | Volatile              | Manual, error-prone       | Naïve              |
|   After (Improved) | 1          | Medium     | Volatile but retained | Structured via ENV/Probes | Good baseline      |

Real-World Scenario: Banking App Traffic Spike

A banking application runs as a single-container Pod serving all responsibilities: balance check, OTP, logs, alerts.
During traffic spikes (e.g., salary day, multiple users checking balance + receiving OTP), performance drops.
OTPs are delayed, logs are dropped, CPU spikes — causing user frustration.

BEFORE: Problematic YAML (Monolithic Pod Pattern)

apiVersion: v1
kind: Pod
metadata:
  name: banking-app
spec:
  containers:
    - name: banking-app
      image: ismailcorp/banking-app:latest
      ports:
        - containerPort: 8080
      env:
        - name: OTP_ENABLED
          value: "true"
        - name: LOGGING_ENABLED
          value: "true"
      resources:
        limits:
          cpu: "500m"
          memory: "256Mi"

What’s Wrong Here?

Issue	Problem
Monolithic Design	OTP, logging, app logic all bundled together
No Scalability	Can't scale OTP or logging independently
CPU Bottlenecks	OTP spikes during traffic cause app logic to slow down
Hard to Audit	Logs generated internally, no audit separation
No Policy Controls	No security policy (e.g., secrets as env vars, no resource governance)

AFTER: Kubernetes Pattern Approach (Sidecar + Separate Deployments + OPA)

`Deployment`: banking-app (business logic)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: banking-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: banking-app
  template:
    metadata:
      labels:
        app: banking-app
    spec:
      containers:
        - name: banking-service
          image: ismailcorp/banking-app:latest
          ports:
            - containerPort: 8080
          resources:
            limits:
              cpu: "300m"
              memory: "256Mi"

`Deployment`: otp-service (Ambassador/Adapter Pattern)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otp-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: otp-service
  template:
    metadata:
      labels:
        app: otp-service
    spec:
      containers:
        - name: otp-generator
          image: ismailcorp/otp-service:latest
          ports:
            - containerPort: 9090
          resources:
            limits:
              cpu: "150m"
              memory: "128Mi"

`Sidecar`: log-agent (Sidecar Pattern for Audit Logging)

apiVersion: v1
kind: Pod
metadata:
  name: banking-app-with-logger
spec:
  containers:
    - name: banking-service
      image: ismailcorp/banking-app:latest
      ports:
        - containerPort: 8080
    - name: log-sidecar
      image: fluent/fluentd:latest
      volumeMounts:
        - name: logs
          mountPath: /var/log/app
  volumes:
    - name: logs
      emptyDir: {}

`OPA Policy`: Enforce Sidecar and No Plaintext Secrets

package kubernetes.admission

deny[msg] {
  input.request.kind.kind == "Pod"
  not input.request.object.spec.containers[_].name == "log-sidecar"
  msg := "Missing audit logging sidecar"
}

deny[msg] {
  input.request.kind.kind == "Pod"
  input.request.object.spec.containers[_].env[_].name == "SECRET_KEY"
  msg := "Secrets should be mounted as volumes, not set as ENV"
}

This is enforced using OPA Gatekeeper integrated into the Kubernetes API server.

`Service`: For App & OTP

apiVersion: v1
kind: Service
metadata:
  name: banking-service
spec:
  selector:
    app: banking-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: otp-service
spec:
  selector:
    app: otp-service
  ports:
    - protocol: TCP
      port: 81
      targetPort: 9090

What Did We Fix?

Fix Area	Before (Monolithic)	After (K8s Patterns + OPA)
Architecture	Single Container Pod	Modular Pods with Sidecars and Deployments
Logging	Built-in, hard to audit	Fluentd Sidecar, independent
OTP Handling	Built-in	Separate OTP service (scalable)
Scalability	Entire pod only	OTP and app scale independently
Security Policies	None	Enforced via OPA (e.g., no plain secrets)
Compliance (e.g. PCI)	Weak	Strong audit and policy governance

Conclusion: Which Kubernetes Pod Approach Should You Use?

Start with One Container per Pod

Cleanest, simplest microservices structure
Easy to debug, scale, and monitor
Ideal for small services, MVPs, or early-stage architectures

Evolve to Kubernetes Pod Patterns as Needed

As complexity grows (especially in fintech, healthcare, or e-commerce), adopt design patterns to solve real-world needs:

Sidecars → Add observability (e.g., logging, tracing), service mesh, proxies
Init Containers → Ensure startup order, handle configuration or bootstrapping
OPA/Gatekeeper Policies → Enforce security, compliance, and governance

In High-Stakes Environments (Fintech, Regulated E-Commerce): Use a Hybrid Approach

Use single-container pods for most microservices
Add sidecars/init containers when functionality justifies it
Enforce policy-driven controls with OPA or Kyverno to meet compliance (e.g., PCI-DSS, RBI, HIPAA)

Trade-off Summary

Approach	Best For	Trade-Offs / Complexity
One Container per Pod	Simplicity, scalability	Lacks orchestration logic
Pod Patterns (Sidecar, Init)	Logging, proxying, workflows	More YAML, more coordination
OPA/Gatekeeper Policies	Compliance, audit, guardrails	Requires policy authoring skills

Final Advice:
Build simple, evolve with patterns, and secure with policy as your system matures.

DEV Community: Ismail Kovvuru

AWS S3 Cross-Account Uploads Failing with 403 AccessDenied

1. what, where, when and how the problem showed up

2. Root cause

3. The exact fix that worked

4. Concrete commands & policy examples

4.1 Inspect current Access Point policy (destination account)

4.2 Minimal Access Point policy that allows a Lambda role in another account to PutObject

4.3 Example bucket policy (destination account) that is compatible

4.4 Test a PutObject using the Access Point ARN (from source account)

5. Why the API showed 500 instead of 403 (and how to avoid masking)

6. When to use this solution, and when not to

When to use: include the source role ARN in Access Point policy

When not to use — alternatives

Alternative A — Cross-account AssumeRole (recommended for programmatic cross-account access)

Alternative B — Pre-signed URLs

Alternative C — Use bucket policies (no Access Point)

7 . Diagnostics checklist / playbook (step-by-step)

8. Prevention: automation, monitoring & best practices

9. Example: Add assume-role alternative (step by step)

10. Short checklist

11. Recommendations (practical & actionable)

12. Conclusion

OpenAI Launches ChatGPT Go in India 🇮🇳

Introducing ChatGPT Go in India

What ChatGPT Go Offers

Easier Subscriptions with INR and UPI

Why We Chose India First

Our Commitment

AWS Lambda Response Streaming Now Supports 200 MB Payloads — 10 More!

What Changed?

Quick Facts

Why This Matters

Developer Tips

In Summary

AWS Lambda Now Supports Avro for Kafka – No More Manual Deserialization

What’s New?

Example Setup (Minimal Configuration)

Lambda Code (Python Example)

Why This Is a Big Deal

Benefits for DevOps & Engineers

Real-World Use Cases

Supported Schema Registries

Caveats & Best Practices

📚 Official AWS Resources

Thoughts

Kubernetes App Slow? Fix DNS, Mesh & Caching.Not Node Scaling

Root Causes Identified:

Real Solutions (Not More Nodes)

1. Use a Service Mesh

2. Fix CoreDNS Configuration

3. Add Caching for Repeated API Calls

Why Not Add Nodes?

Why Only These Solutions?

Are There Better Alternatives?

Always Measure Before Scaling

Final :

AWS Docs MCP Server for DevOps Assistance

Using AWS Docs MCP Server for Accurate DevOps Assistance

What Is AWS Docs MCP Server?

How to Set It Up

What This Script Does

What it does:

Why we're doing it:

What output you’ll get:

Explanation of Each Field

Why It Matters for DevOps

Glossary of Terms Used

Quick Start Links

Thoughts

Using Amazon EFS with AWS Lambda: Persistent File Storage

AWS Lambda + EFS: Scalable File Storage for Serverless Workloads

What This Means in Simple Terms

Why /tmp Isn’t Enough in Lambda

Why Use Amazon EFS with Lambda?

Cost Comparison: EFS vs S3 vs /tmp vs ElastiCache

How to Use EFS with Lambda – Step-by-Step

Python Sample (Writing to EFS):

Real-World Use Case

When Should You Use Lambda with EFS?

5. Why the API showed `500` instead of `403` (and how to avoid masking)

AWS Lambda Response Streaming Now Supports 200 MB Payloads — 10 More!

1. Amazon EFS (`AWS::EFS::FileSystem`)

What Is `--skip-os-shutdown`?