Rakesh Tanwar

Posted on Dec 24, 2025

Common Mistakes Enterprises Make with Cloud Storage and How to Avoid Them

#architecture #cloud #performance

Over and over, I see big enterprises burn money, tank performance, or create compliance nightmares because they treat cloud storage like a magic infinite disk. It isn’t. It’s a toolbox. And if you use a hammer for everything, eventually you’re going to hit your thumb. Here are the most common mistakes I see, and how I’d avoid them if I were rebuilding from scratch.

1. Treating cloud storage like an on-prem SAN
The classic one: “We moved to the cloud, so we provisioned giant network volumes and mounted them everywhere. Done.”

That’s not “cloud,” that’s your old data center with extra steps.

Block storage has its place (databases, certain legacy apps), but:

It doesn’t scale like object storage
It’s usually more expensive at large capacity
It ties data to specific instances and zones

What I do instead
I start with object storage as the default for anything that is:

Shared across teams
Read-heavy
Long-lived

Block storage is reserved for latency-sensitive, tightly coupled workloads. If I catch myself putting “everything” on block storage, that’s my red flag that I’m just re-implementing the old world in the cloud.

2. Keeping everything in the hottest (most expensive) tier
I once reviewed a storage bill for an enterprise where 90%+ of the data hadn’t been touched in over a year—all sitting in premium “hot” storage. Their monthly bill was basically a museum ticket for data nobody visited.

This happens because:

Nobody owns lifecycle policies.
“We’ll clean it up later” quietly becomes “never.”
Teams are afraid of archive tiers because they don’t trust they can get data back.

How to avoid it:

Classify data into hot / warm / cold / archive.
Put automated lifecycle policies on every bucket by default:
- After X days → cool tier
- After Y days → archive or delete
Only exempt datasets where you actively justify why they must stay hot.

My rule: if no one can name a reason a dataset must be hot within 5 seconds, it probably shouldn’t be.

3. Ignoring egress and API costs
Everyone obsesses over “$ per GB per month” and then gets ambushed by:

Cross-region egress
“Chatty” apps making millions of small GET/PUTs
Constant re-downloading of the same objects

I’ve seen GPU training jobs where the storage API bill rivaled the compute bill because the data loader was pulling tiny objects one by one across regions.

How I avoid this?

Co-locate compute and storage in the same region by default.
For high-I/O workloads, shard small files into larger objects (webdataset, tar, parquet, etc.).
Use caching:
- Local NVMe or node-local SSDs as a read-through cache for frequently accessed datasets.
Set up cost dashboards that actually surface:
- Top egress sources
- Top buckets by API requests

If you don’t measure egress and API calls, you’ll be surprised. And cloud surprise is always expensive.

4. No data locality strategy for performance-critical workloads
From the GPU side, this one hurts the most.

I’ve seen enterprises deploy multi-million-dollar GPU clusters, then point them at data sitting:

In another region
In another cloud
On a sad NFS box hidden behind a VPN

Then they wonder why GPU utilization is 40%.

My rule
For performance-sensitive jobs (training, large-scale analytics, latency-sensitive inference):

Data and compute must live as close as physically possible.
For big training workloads:
- Keep canonical data in object storage in the same region.
- Stage active shards onto local NVMe before the job starts.
For critical real-time inference:
- Keep models and key features on local SSD / high-performance block.

If you’re paying for high-end GPUs, it’s almost always cheaper to over-provision fast storage than to let those GPUs idle waiting for bytes.

5. Over-sharing and under-governing buckets
Another common pattern: one giant “data” bucket with:

Broad access
Flat structure
Ad hoc naming
No clear ownership

It works fine until:

Someone deletes a folder they shouldn’t.
An internal tool exposes data it shouldn’t.
Nobody knows who can approve access because “everyone uses that bucket.”

How I handle it?

Design for data domains, not “one bucket to rule them all”: analytics-, ml-, raw-, archive-, etc.
- Assign clear ownership per bucket/domain:
  - Data owner
  - Access policy owner
  - Lifecycle policy owner
Use least-privilege IAM:
- Read-only where possible
- Narrow write permissions
- Strong separation between production and experiment buckets

Security teams love this. So do auditors. But more importantly, it reduces accidents.

6. No versioning, no backups, no restore tests
This is the quiet killer.

I still see critical buckets with:

Versioning turned off
No backup or replication strategy
No tested restore process

Then one day, a bad script runs rm -rf in the wrong prefix, and suddenly everyone discovers that “11 nines of durability” doesn’t mean “undo button.”

My practical approach

Turn on versioning for:
- Any bucket storing production models, configs, or critical reference data.
Have a clear replication / backup story:
- Cross-region replication for “if this region dies, we’re in trouble” datasets.
- Separate “backup projects/accounts” to isolate from accidental deletion.
Actually test restores:
- Pull a random dataset from backup.
- Time how long it takes and what breaks.

If you’ve never practiced a restore, assume it doesn’t work.

7. Letting everyone do “whatever they want” forever
Some chaos is healthy. But I’ve worked with enterprises where every team:

Invents their own folder structure
Chooses random storage classes
Builds slightly different ingestion pipelines

On day one, this feels like “autonomy.” By year two, it’s data hell.

What I recommend?

Create a small set of storage patterns:
- “Analytics dataset pattern”
- “ML training dataset pattern”
- “Archive pattern”
Provide templates and tooling:
- Terraform modules, bucket naming conventions, lifecycle defaults.
Allow deviations—but make them explicit decisions, not accidents.

The goal isn’t central control for its own sake. It’s to avoid having 20 ways to do the same thing, all slightly broken in different ways.

Bringing it together
When I walk into an enterprise as a cloud GPU person, I’ve learned not to start by asking “what GPUs are you using?” I start with:

Where does your data live?
Who owns which buckets?
What are your lifecycle policies?
How often do you move or restore data?

Most “GPU performance issues” I see are really storage design issues in disguise.

If you treat cloud storage as a strategic system (classify data, control access, manage lifecycle, test restores, and care about locality), you’ll get better security, lower bills, and much happier GPUs.

DEV Community

Common Mistakes Enterprises Make with Cloud Storage and How to Avoid Them

Top comments (0)