Eliana Lam for AWS Community On Air

Posted on Nov 22

Run OSS LLMs on a Single H100 Smarter, Cheaper, Faster

#aws #cloud #beginners #productivity

Speaker: Adit Modi Adit Modi @ AWS Amarathon 2025

Summary by Amazon Nova

Introduction to P5.4xlarge Instance:

AWS introduces P5.4xlarge, enabling the hosting of powerful open-source LLMs (like Qwen-32B, Mistral, and LLaMA 2) on a single NVIDIA H100 GPU.
Addresses the previous need for overprovisioning large, expensive multi-GPU clusters.

Target Audience:

Engineers, startups, and individual tinkerers who want to host OSS LLMs without high costs or performance compromises.
Teams looking to scale smart rather than scale up with massive clusters.

Key Benefits:

Cost-Efficiency: Reduces the need for overprovisioning and managing complex multi-GPU clusters.
Performance: Offers no compromises on memory or compute power with a single H100 GPU.
Accessibility: Makes powerful GenAI capabilities available to more builders.

Session Highlights:

Exploration of real-world benchmarks for OSS models on P5.4xlarge.
Discussion on agentic GenAI workflows and cost-saving strategies.
Practical deployment tips for models like Hugging Face transformers and low-latency chatbots.

Practical Outcomes:

Understanding which OSS models run efficiently on a single H100.
Expectations for performance on P5.4xlarge.
Strategies for designing infrastructure that matches specific needs without overprovisioning.

Conclusion:

“Most of us don’t need an 8-GPU monster to ship a useful chatbot or GenAI app, but we still end up paying for one.”
Empower small teams and individual engineers to deploy OSS LLMs effectively and cost-efficiently using AWS P5.4xlarge.

Introduce the Need:

“Imagine if, instead of paying for and managing 8 GPUs just to get access to one H100, you could pick an instance size that finally matches your workload.”
Problem:
Inefficiency and cost of overprovisioning large GPU clusters.

Announce the Breakthrough:

New Offering:
“AWS recently introduced new single-GPU P5 instance sizes: p5.4xlarge gives you the latest NVIDIA H100—no more overbuying, no waste.”
Solution:
Emphasizes the elimination of unnecessary expenses and complexity.

Highlight the Specs and Simplicity:

Instance Specifications:
“p5.4xlarge packs everything you need: 1×H100 GPU, 16 vCPUs, 256 GiB RAM, nearly 4 TB NVMe SSD, and 100 Gbps network.”
Benefits:
“All the power of a flagship GPU with simple deployment and a lower bill.”

Make the Price Contrast Real:

Cost Comparison:
“You can rent this for around $3.90 per hour in some regions, compared to paying for all 8 GPUs on bigger P5s.”
Impact:
“This significantly lowers the bar for experiments, demos, and even production for small teams.”

Before and after:

Previous Scenario:
“Before, H100 meant ‘enterprise scale’ and massive upfront cost.”
Current Scenario:
“Now, with a single click, you have pro-grade capability—ideal for single-tenant APIs, agentic RAG, internal tools, and startups moving fast.”
Transformation:
Focuses on making high-performance GPU capabilities accessible and affordable.

What You Want the Audience to Take Away:

Accessibility:
“p5.4xlarge puts NVIDIA H100 power in reach for ‘regular’ use cases—no need for massive clusters.”
Efficiency:
“Simpler, cheaper, and perfectly sized for most open-source LLM serving and experimentation.”
Optimal Solution:
“If you want the best GPU for GenAI, now you can get just one—no clusters required.”

What really fits on a single H100?

Qwen3‑32B for General Chat and Strong Reasoning:

“Qwen3‑32B is a dense, 32.8B parameter model with up to 32k token context—so actual document chat, coding, even agentic use-cases work great.”
Performance:
“On a single H100, it cruises at ~1,500 tokens/sec, serving a few simultaneous chats with headroom.”

Mistral/Mixtral for Efficiency:

“Mistral 7B and Mixtral‑7x8B: these are optimized for inference, even outperforming 70B dense models in some benchmarks—despite activating way fewer parameters per token.”
Performance:
“TensorRT‑LLM on H1100 gives about 3x the throughput of A100. That means you get near-70B performance for a fraction of the hardware.”

Llama 2‑13B / 70B: Perspective and When Single Isn’t Enough:

“Llama 2‑13B flies on a single H100 (5,000+ tokens/sec); 70B can work but for true at-scale, sometimes you do need multiple GPUs.”
Use Cases:
“For most chat and RAG products, 13–32B fits and flies. 70B is only a must for the highest quality or biggest models.”
“If you need ≤32B dense (or 13B active MoE), one H100 is enough—for real-time chatbots and GenAI APIs.”

Framing the Value:

“Let’s look at three common GenAI architectures that work beautifully on a single H100—without distributed infrastructure or cluster headaches.”

Pattern 1 – Low-latency chat/inference API:

“You can serve real chatbots and inference APIs on just one server: load models with Hugging Face Transformers, serve with vLLM or TensorRT‑LLM, and front it with a simple API gateway.”
Benefits:
“H100’s grunt means higher concurrency and less code complexity—no sharding or parallelism tricks needed.”

Pattern 2 – Retrieval-Augmented Generation (RAG):

“For RAG, keep your vector DB and docs off-GPU, let the H100 do the heavy LLM generation.”
Benefits:
“This keeps costs down and performance up. Modern MoE models even fit in one H100 with smart quantization—no special hardware hackery needed.”

Pattern 3 – Agentic workflows:

“Want to chain actions and make agents? One H100 runs a 32B planner model, which calls tools (via HTTP, Lambda, containers etc).”
Benefits:
“The GPU is only busy ‘thinking’, so you can power multiple agent flows and users at once, per instance.”

Summary:

Build most GenAI products—chat, APIs, RAG, and agents—using just one H100 instance.”

Framing the Comparison:

“Let’s translate everything so far into a simple decision: when does one H100 make sense, and when is a cluster truly warranted?”

Efficiency:

“Studies show multi-GPU clusters rarely give perfectly linear scaling. With 8 GPUs, you often get just 75–85% of the speed you expect—inter-GPU communication slows things down, especially for real-time inference.”

Cost and Agility Advantage:

“With p5.4xlarge pricing around $3.90/hr, small teams can run production-grade GenAI for a few dollars, no need for huge cluster commitments.”

Single H100 Use Cases:

“If your model is 13–32B or an MoE with ~13B params active, a single H100 delivers: dev, internal tools, early production, moderate traffic—all without cluster headaches.”

Multi-GPU Use Cases:

“Only go multi-GPU if you need true 70B+ scale or must support huge traffic.”

Future Reference:

“If you ever push past what a single H100 can do, the H200 is on the horizon with higher throughput.”
“But for nearly everyone today, H100 is the sweet spot.”

Summarize the Key Takeaways: