errorbudget

Posted on Jun 9 • Originally published at errorbudget.io

The AI memory crunch: how DRAM and NAND price shocks reshape infrastructure budgets

#infrastructure #devops #hardware #ai

Something significant is happening to the cost structure of enterprise infrastructure, and it is not getting enough attention in IT planning conversations.

DDR5 server memory prices have tripled or quadrupled over the past year. Enterprise NVMe SSDs have seen even more dramatic moves — some 30TB TLC drives went from around $3,000 to over $17,000 in nine months. NAND wafer spot prices climbed roughly 9x from mid-2025 levels. And memory manufacturers have publicly stated they are sold out through 2026, with no meaningful new capacity arriving before late 2027.

The cause is straightforward: AI infrastructure buildout is consuming a disproportionate share of memory and storage manufacturing capacity, leaving enterprise buyers competing for what remains.

The consequence is that infrastructure budgets built in 2024 are no longer valid for 2026 procurement. Refresh cycles, capacity expansions, and even routine maintenance face cost pressures that did not exist 18 months ago. Teams I talk to are quietly absorbing 30-60% line item increases on hardware that used to be predictable commodity purchases.

This article documents what is happening, why it matters operationally for infrastructure teams, and the procurement strategies that have helped us navigate the current market. The data points are public; the operational responses are from our own experience.

What the numbers actually look like

Before getting to strategy, the magnitude needs to be clear. The price movements are not within normal market volatility — they reflect structural reallocation of memory supply globally.

DRAM (server memory)

Multiple data sources tell a consistent story:

32GB DDR5 modules: Samsung raised list prices from roughly $149 to $239 in late 2025, a 60% increase, then again in subsequent quarters
64GB DDR5 RDIMMs (the workhorse of enterprise servers): Counterpoint Research projected prices could double from early 2025 levels by end of 2026
DDR5 contract pricing: Climbed 50% in 2025, projected another 30% in Q4 2025, plus 20% more in early 2026
DDR4 memory: Caught up to DDR5 pricing as supply got reallocated. 32GB DDR4 kits went from $60-90 to $150-180 in roughly six months
TrendForce Q2 2026 forecast: Server DRAM contract prices expected to rise another quarter-over-quarter
SK Hynix: Reported in late 2025 earnings that HBM, DRAM, and NAND capacity was sold out through 2026
Micron: Stopped quoting some products entirely, told customers it could only satisfy 55-60% of demand from main customers

For a typical 2U server with 16 DIMM slots running 64GB modules (1TB total memory), the memory bill alone has moved from roughly $8,000-10,000 in early 2025 to $20,000+ in early 2026 — and is still rising.

NAND (storage)

The storage side is even more dramatic:

30TB enterprise TLC SSDs: $3,062 in Q2 2025, $17,500 in Q1 2026 — a 472% increase
8TB NVMe SSDs: Some configurations exceeded $1,400 retail, working out to more expensive per gram than gold
NAND wafer spot prices: Climbed roughly 9x from mid-2025 levels
TrendForce Q1 2026 data: Client SSD contract prices increased at least 40% quarter-over-quarter
Kingston: Reported 246% increase in NAND wafer costs
Western Digital: CEO confirmed the company is completely sold out for 2026
Q2 2026 forecast: NAND Flash contract prices expected to rise another 70-75% QoQ
Micron exited the Crucial consumer brand entirely in late 2025 to redirect all production toward AI and enterprise customers

For our environment, this means a vSAN ESA cluster build that we estimated at $400K of storage in 2025 now costs $700-900K — without any change in design.

HDD (still relevant for archive tiers)

Even traditional spinning hard drives are affected because the AI buildout includes massive cold storage requirements:

HDD lead times stretched from 8-12 weeks to 20-30 weeks
Pricing per TB up 30-50% year over year
Major manufacturers (Seagate, WD, Toshiba) reporting hyperscaler allocation taking majority of high-capacity drives

The cascade effect: AI workloads need fast SSDs for active datasets, but they also need cheap HDDs for archive. Both segments are constrained.

Why this is happening: the AI infrastructure pull

The supply-side explanation is straightforward. Memory and storage manufacturers are responding rationally to demand signals.

HBM economics dominate

High-bandwidth memory (HBM) is the most profitable memory product in the industry right now. It is required for NVIDIA H100, H200, B100, B200, and similar AI accelerators. Each AI GPU consumes substantially more HBM than conventional memory.

Industry analysis suggests HBM consumes 3x the wafer capacity per gigabyte compared to standard DDR5. Manufacturers have made the calculation: produce HBM (60%+ gross margins) instead of consumer DRAM (margin pressure). The wafer reallocation is global and structural.

Samsung Q1 2026 earnings showed 755% profit growth, with 95% from memory. The financial signal to memory manufacturers is unambiguous: prioritize AI customers.

Enterprise SSD demand from hyperscalers

According to public sources, Microsoft Azure and AWS each bought more than 500,000 SSDs per quarter in 2025 to feed AI inference clusters. IDC reported the worldwide server market grew 97.3% in spending in Q2 2025 alone.

Hyperscalers buy in volume with long-term agreements. Manufacturers prioritize these customers. Enterprise SSD now represents about 60% of global NAND production by value, up from much lower historical share.

The remainder of the NAND market — consumer SSDs, enterprise buyers who are not hyperscalers, embedded applications — competes for what is left.

Supply expansion is years away

New memory fabs take 3-5 years to build and bring online. Even if every manufacturer announced new capacity tomorrow, meaningful supply relief would not arrive until 2027-2028 at earliest.

Major manufacturers have explicitly not announced aggressive expansion:

Samsung's wafer output is decreasing from 4.9 million (2025) to 4.7 million (2026) as production shifts to HBM
SK Hynix output decreasing from 1.9 million to 1.7 million wafers
Micron focusing on AI and enterprise, not consumer
Chinese manufacturers (YMTC, CXMT) expanding but not at full capacity until 2027-2028

The Big 3 NAND manufacturers are choosing to maintain high prices rather than chase volume. From their perspective, this is a "memory super-cycle" that should be milked, not flooded.

What this means operationally for infrastructure teams

The price increases affect operational decisions across our environment. Here is what we have seen change.

Refresh cycles delayed

Standard 4-year refresh cycle for compute and storage gets re-evaluated. If hardware refresh adds 40-60% cost over previous cycle, does it pencil out? In many cases the answer is "delay another year, extend support contracts, accept some performance trade-off."

We have two refresh projects that were planned for Q4 2026 now slipped to Q2 2027. The original budget cannot fund the original specification at current prices.

Specification compromises

Where refresh proceeds, specifications get trimmed:

1TB memory configurations becoming 512GB
8x NVMe per server becoming 4x NVMe + 2x SAS SSD
All-NVMe vSAN ESA plans reverting to hybrid OSA configurations
High-density GPU servers downsized from 8 to 4 GPUs

Each compromise has performance implications. Capacity planning conversations get harder when the cost-per-GB equation has shifted dramatically.

Capacity expansion deferred

Capacity expansion projects that were "buy when we need it" decisions now become "buy 18 months ahead and stockpile" decisions. We are seeing teams pre-purchase inventory they would normally just-in-time, simply because availability is uncertain.

This is operationally awkward — sitting on inventory has carrying costs and obsolescence risk — but the alternative is project delays.

Procurement timeline changes

Procurement that took 6-8 weeks now routinely takes 12-16 weeks or longer. Specific memory and storage SKUs may not be quotable. Vendors offer commitments but with longer lead time disclaimers.

We have moved budget planning conversations earlier — Q3 planning for next year's procurement, not Q4. The further out we lock in pricing, the more predictable the outcomes.

Cloud cost reconsideration

The cost gap between on-premise and cloud has shifted. Cloud GPU pricing has remained more stable than on-premise hardware cost-per-unit-performance. Some workloads that we kept on-premise for cost reasons now look closer to cloud pricing economics.

We have not migrated significant workloads, but the conversation has shifted from "cloud is too expensive" to "cloud is no longer obviously more expensive than building it ourselves."

Audit and compliance impact

Even compliance-related infrastructure feels the pressure. Our regulatory requirements for retention storage (7-year audit logs, compliance archives) drive HDD purchasing. HDD prices and lead times have pressured those projects too.

We have not changed compliance approach, but the cost of meeting compliance has gone up materially.

Procurement strategies that have helped

Across multiple procurement cycles in this market, here is what has worked for us.

Multi-year supply agreements

We moved from spot purchasing to long-term supply agreements with key vendors. The deal structure: commit to volume for 3-year horizon, get price protection against further increases (capped at predefined inflation rates), accept longer lead time guarantees in exchange.

This shifts risk from spot price volatility to volume commitment risk. For predictable workloads (banking infrastructure that grows steadily), the trade-off works. For uncertain growth (AI workloads), it requires careful sizing.

The contracts include hardship clauses if our usage drops dramatically. Not perfect insurance, but better than spot market exposure.

Vendor diversification

We had concentrated relationships with a small number of OEM partners (Dell, HPE primarily). The current market has rewarded diversification.

We added secondary suppliers for specific components:

Memory: validated sourcing from multiple Tier 1 suppliers
SSDs: qualified alternative vendors for enterprise NVMe
HDDs: split allocation across Seagate and WD

When one supplier hits allocation issues, we have alternatives. The diversification adds operational complexity but reduces single-vendor risk significantly.

Inventory buffer

For critical paths, we now maintain 6-month buffer inventory rather than just-in-time. Memory and SSD specifically, where shortages are most acute.

Cost: ~$200K-500K of working capital tied up depending on cluster size. Benefit: project predictability and ability to respond to unplanned demand.

This is a meaningful working capital decision. For some organizations, the carrying cost is unacceptable. For us, the operational risk of stockouts (delaying critical projects, missing audit deadlines) justifies the inventory carry.

Re-baseline storage architecture

We re-evaluated storage choices given new economics:

Hybrid storage (NVMe cache + SAS capacity) regained favor where we had been planning all-NVMe
Tiering strategies reviewed: more cold data on HDD, only hot data on NVMe
Compression and deduplication enabled where we had disabled them for performance
Object storage for archive (S3-compatible on-premise) considered as alternative to NVMe for certain workloads

The result: same workloads, smaller flash footprint, more aggressive tiering. Performance per dollar is the new optimization target, not raw performance.

💡 Model your storage trade-offs: Use our vSAN capacity calculator to compare OSA hybrid vs ESA all-NVMe sizing for your workload. Account for RAID overhead and slack space before committing to a hardware spec at current prices. Runs in your browser, no signup.

Right-sized memory

For non-AI workloads, we are right-sizing memory configurations more aggressively:

Database servers: 1TB → 768GB where workload allows
VM hosts: review actual memory consumption vs allocation, reclaim where over-allocated
Cache layers: smaller cache + better cache algorithms vs larger cache

These are micro-optimizations individually but add up to meaningful procurement savings.

Pre-negotiate years 4-5 of new purchases

For new hardware purchases, we now pre-negotiate maintenance and expansion pricing for years 4-5 (beyond standard support contract). This protects against memory or storage prices being even higher when we eventually expand.

Vendors resist this — they want to keep future pricing flexible. But for large enough commitments, they will agree to caps or formula-based pricing for future expansions.

Used / refurbished hardware

The market for refurbished enterprise hardware has become more interesting. Hardware that is 1-2 generations behind, fully tested and warranted, often available at meaningful discount.

For non-critical workloads (dev/test, certain backup tiers), refurbished can fill the gap. We use this selectively for environments where the support model accepts it.

Capacity planning in a constrained market

The deeper operational change is around capacity planning. Traditional approach: forecast growth, buy ahead of need by 6-12 months. Current approach requires more nuance.

Demand-side planning

We have invested more in understanding actual workload growth patterns:

AI workload growth is uncertain — could be 3x or 0.5x year-over-year
Banking workload growth is predictable — typically 8-15% per year
Compliance and archive growth is regulatory — locked-in growth

Each demand segment has different planning horizons. The mixed allocation requires different procurement strategies per segment.

Supply-side dependencies

We track supply signals from manufacturers:

Earnings calls from Samsung, SK Hynix, Micron
TrendForce and Counterpoint pricing reports
Vendor commentary on lead times
Manufacturer capacity announcements

When a major manufacturer announces capacity constraint, we accelerate that segment of procurement. When supply signals soften, we delay. This is more market awareness than infrastructure teams traditionally maintain, but the current market requires it.

Scenario planning

For each major project, we now run three procurement scenarios:

Best case: Procurement at current quote, no further price increases
Base case: 20-30% price increase between quote and delivery
Stress case: 50%+ price increase or component unavailability requiring redesign

Each scenario gets budgeted. Final budget reflects expected value across scenarios. Projects that fail at stress case get redesigned upfront.

This is more conservative planning than we used historically. The current market justifies it.

What we expect going forward

Based on industry reporting and our own observations, here is the timeline we are planning against:

2026 (current year)

Memory and SSD prices continue rising through Q2-Q3
Some stabilization possible in Q4 if new capacity comes online
Overall: budget for 30-50% higher hardware costs vs 2024 baselines
Lead times remain extended (10-16 weeks for memory, 12-20 weeks for SSDs)

2027

Modest supply relief expected as new capacity ramps
Pricing likely declines 10-20% from 2026 peak, but not back to 2024 levels
Manufacturers transitioning more capacity to HBM as AI demand continues
Some industries (consumer electronics) see structural cost increases that persist

2028 and beyond

Capacity expansion fully online, supply-demand balance improves
New construction memory fabs in China contributing meaningful volume
HBM becomes commoditized as competition increases
Standard DDR and NAND pricing stabilizes at "new normal" levels (higher than 2024)

The structural shift is unlikely to fully reverse. AI infrastructure will continue absorbing significant memory and storage capacity. Pricing relief comes from new capacity, not from AI demand collapse.

For infrastructure planning, this means: the new pricing reality is mostly permanent. Procurement strategies that worked before need updating.

What this means for AI infrastructure specifically

A note on the irony: we are operating AI infrastructure that is part of the demand driving up our other procurement costs.

GPU infrastructure pulls supplementary costs

When we deploy NVIDIA H100 GPUs, we also need:

More server memory (large model training requires substantial DDR5)
High-speed local NVMe (training datasets, checkpoints)
HBM (built into the GPU, but consumes the same wafer capacity as our other memory)

The GPU price is the visible cost. The supplementary memory and storage cost is roughly equivalent and growing.

A complete AI training node configuration that was $80K of hardware in 2024 is closer to $130-150K in 2026 for equivalent specs. The GPU is a smaller share of total cost than it used to be.

AI inference becomes more memory-constrained

Inference workloads need fast model loading. That means large memory footprints to keep models resident. As memory costs rise, the economics of running models on-premise vs cloud shift.

We are seeing more workloads consider serverless inference patterns where cloud providers maintain the memory footprint and amortize across customers. Our own deployments are more selective about which models live in fast memory vs cold storage.

Compliance archives feel the pressure

AI workloads generate large training artifacts, intermediate checkpoints, and model versions. Compliance requires retaining these for audit purposes. The retention storage cost has grown substantially.

We have implemented more aggressive lifecycle policies on AI artifacts — keep only what genuinely needs retention, age out aggressively to cold storage, accept some replay cost vs storing everything.

A note on what does not help

Some commonly suggested responses are not actually helping in our experience.

Spot market purchasing

The spot market for memory and SSDs has become more volatile, not less. Price discovery is happening through contracts, not spot. Spot pricing often exceeds contract pricing because available supply is short-term.

We have moved away from spot purchasing where possible.

Delaying purchases waiting for price drops

The prevailing analysis is that prices will not drop meaningfully before 2027. Delaying purchases means risking project deadlines while still paying high prices when you eventually buy.

For most workloads, buying in 2026 is more cost-effective than waiting for 2027 hoping for relief.

Switching to consumer-grade alternatives

Some teams have suggested consumer SSDs or memory as cost-saving alternatives. This has not worked for us:

Consumer SSDs lack the endurance and reliability for enterprise workloads
Consumer memory often lacks ECC, problematic for production systems
Audit compliance often requires enterprise-grade components with proper certifications
Support contracts typically require enterprise-validated components

The cost savings are real but the risk is too high for regulated workloads.

What I would recommend to colleagues

For infrastructure operators dealing with this market for the first time:

1. Update budget assumptions for ongoing operations

If your operational budget assumed 2023-2024 hardware prices for refresh and expansion, those budgets need revision. Plan for 30-50% higher line items, with continued pressure through 2026.

2. Have explicit conversations with finance

Finance teams may not understand why specific hardware lines are 60% higher. Industry context matters. Share TrendForce data, Counterpoint reports, manufacturer earnings. Build the case that this is industry-wide structural, not your team mismanaging procurement.

3. Move to longer planning horizons

Quarterly procurement planning is too short in this market. We have moved to 18-month rolling plans with quarterly updates. Longer horizon allows volume commitments and vendor negotiations.

4. Diversify supplier relationships

If you have a single primary supplier, qualify a secondary now. Allocation issues at one vendor get resolved by switching to another. The qualification work is meaningful but worth doing before you need it.

5. Build internal awareness

Engineering teams that consume infrastructure resources should understand cost dynamics. Memory and storage requests that used to be "cheap" line items now warrant scrutiny. Right-sizing conversations are healthy.

6. Plan for supply uncertainty

For critical workloads, redundancy strategy may need to include "supplier failure" alongside "hardware failure." If your single vendor has allocation issues, can you still operate?

We have not had to invoke supplier-failure scenarios yet. Having the plan exists matters.

Closing notes

The AI infrastructure buildout is transferring real cost to every infrastructure team operating in the same memory and storage markets. This is not theoretical — it is showing up in current quarter procurement quotes.

The structural nature of the shift means it will not resolve quickly. Memory and storage fabs take years to build. AI demand shows no signs of slowing. Manufacturers are choosing margin over volume, which is rational behavior but does not relieve buyer pressure.

For infrastructure teams, the response is operational: longer planning horizons, supplier diversification, inventory buffers, scenario planning, and explicit conversations with finance teams about the new cost reality. The teams that adapt their procurement approach navigate the market reasonably. The teams that try to operate with 2024 assumptions hit budget and project delivery issues.

Future articles will cover the specific procurement contract structures that have worked, vendor negotiation patterns in tight markets, and the capacity planning models we use for mixed AI and traditional workloads. Subscribe to follow along.

Notes on procurement strategy in the current memory and storage market. Pricing data points reflect public reporting from TrendForce, Counterpoint Research, IDC, and manufacturer disclosures through Q2 2026. Your specific procurement experience will vary by region, volume, and vendor relationships. This is operator perspective on managing infrastructure budgets in a structurally shifted market, not financial advice.