DEV Community

Cover image for Run OSS LLMs on a Single H100 Smarter, Cheaper, Faster
Eliana Lam for AWS Community On Air

Posted on

Run OSS LLMs on a Single H100 Smarter, Cheaper, Faster

Speaker: Adit Modi Adit Modi @ AWS Amarathon 2025

Summary by Amazon Nova



Introduction to P5.4xlarge Instance:

  • AWS introduces P5.4xlarge, enabling the hosting of powerful open-source LLMs (like Qwen-32B, Mistral, and LLaMA 2) on a single NVIDIA H100 GPU.

  • Addresses the previous need for overprovisioning large, expensive multi-GPU clusters.

Target Audience:

  • Engineers, startups, and individual tinkerers who want to host OSS LLMs without high costs or performance compromises.

  • Teams looking to scale smart rather than scale up with massive clusters.

Key Benefits:

  • Cost-Efficiency: Reduces the need for overprovisioning and managing complex multi-GPU clusters.

  • Performance: Offers no compromises on memory or compute power with a single H100 GPU.

  • Accessibility: Makes powerful GenAI capabilities available to more builders.

Session Highlights:

  • Exploration of real-world benchmarks for OSS models on P5.4xlarge.

  • Discussion on agentic GenAI workflows and cost-saving strategies.

  • Practical deployment tips for models like Hugging Face transformers and low-latency chatbots.

Practical Outcomes:

  • Understanding which OSS models run efficiently on a single H100.

  • Expectations for performance on P5.4xlarge.

  • Strategies for designing infrastructure that matches specific needs without overprovisioning.

Conclusion:

  • “Most of us don’t need an 8-GPU monster to ship a useful chatbot or GenAI app, but we still end up paying for one.”

  • Empower small teams and individual engineers to deploy OSS LLMs effectively and cost-efficiently using AWS P5.4xlarge.



Introduce the Need:

  • “Imagine if, instead of paying for and managing 8 GPUs just to get access to one H100, you could pick an instance size that finally matches your workload.”

  • Problem:

  • Inefficiency and cost of overprovisioning large GPU clusters.

Announce the Breakthrough:

  • New Offering:

  • “AWS recently introduced new single-GPU P5 instance sizes: p5.4xlarge gives you the latest NVIDIA H100—no more overbuying, no waste.”

  • Solution:

  • Emphasizes the elimination of unnecessary expenses and complexity.

Highlight the Specs and Simplicity:

  • Instance Specifications:

  • “p5.4xlarge packs everything you need: 1×H100 GPU, 16 vCPUs, 256 GiB RAM, nearly 4 TB NVMe SSD, and 100 Gbps network.”

  • Benefits:

  • “All the power of a flagship GPU with simple deployment and a lower bill.”

Make the Price Contrast Real:

  • Cost Comparison:

  • “You can rent this for around $3.90 per hour in some regions, compared to paying for all 8 GPUs on bigger P5s.”

  • Impact:

  • “This significantly lowers the bar for experiments, demos, and even production for small teams.”

Before and after:

  • Previous Scenario:

  • “Before, H100 meant ‘enterprise scale’ and massive upfront cost.”

  • Current Scenario:

  • “Now, with a single click, you have pro-grade capability—ideal for single-tenant APIs, agentic RAG, internal tools, and startups moving fast.”

  • Transformation:

  • Focuses on making high-performance GPU capabilities accessible and affordable.

What You Want the Audience to Take Away:

  • Accessibility:

  • “p5.4xlarge puts NVIDIA H100 power in reach for ‘regular’ use cases—no need for massive clusters.”

  • Efficiency:

  • “Simpler, cheaper, and perfectly sized for most open-source LLM serving and experimentation.”

  • Optimal Solution:

  • “If you want the best GPU for GenAI, now you can get just one—no clusters required.”



What really fits on a single H100? 

Qwen3‑32B for General Chat and Strong Reasoning:

  • “Qwen3‑32B is a dense, 32.8B parameter model with up to 32k token context—so actual document chat, coding, even agentic use-cases work great.”

  • Performance:

  • “On a single H100, it cruises at ~1,500 tokens/sec, serving a few simultaneous chats with headroom.”

Mistral/Mixtral for Efficiency:

  • “Mistral 7B and Mixtral‑7x8B: these are optimized for inference, even outperforming 70B dense models in some benchmarks—despite activating way fewer parameters per token.”

  • Performance:

  • “TensorRT‑LLM on H1100 gives about 3x the throughput of A100. That means you get near-70B performance for a fraction of the hardware.”

Llama 2‑13B / 70B: Perspective and When Single Isn’t Enough:

  • “Llama 2‑13B flies on a single H100 (5,000+ tokens/sec); 70B can work but for true at-scale, sometimes you do need multiple GPUs.”

  • Use Cases:

  • “For most chat and RAG products, 13–32B fits and flies. 70B is only a must for the highest quality or biggest models.”

  • “If you need ≤32B dense (or 13B active MoE), one H100 is enough—for real-time chatbots and GenAI APIs.”



Framing the Value:

  • “Let’s look at three common GenAI architectures that work beautifully on a single H100—without distributed infrastructure or cluster headaches.”

Pattern 1 – Low-latency chat/inference API:

  • “You can serve real chatbots and inference APIs on just one server: load models with Hugging Face Transformers, serve with vLLM or TensorRT‑LLM, and front it with a simple API gateway.”

  • Benefits:

  • “H100’s grunt means higher concurrency and less code complexity—no sharding or parallelism tricks needed.”

Pattern 2 – Retrieval-Augmented Generation (RAG):

  • “For RAG, keep your vector DB and docs off-GPU, let the H100 do the heavy LLM generation.”

  • Benefits:

  • “This keeps costs down and performance up. Modern MoE models even fit in one H100 with smart quantization—no special hardware hackery needed.”

Pattern 3 – Agentic workflows:

  • “Want to chain actions and make agents? One H100 runs a 32B planner model, which calls tools (via HTTP, Lambda, containers etc).”

  • Benefits:

  • “The GPU is only busy ‘thinking’, so you can power multiple agent flows and users at once, per instance.”

Summary: 

  • Build most GenAI products—chat, APIs, RAG, and agents—using just one H100 instance.”


Framing the Comparison:

  • “Let’s translate everything so far into a simple decision: when does one H100 make sense, and when is a cluster truly warranted?”

Efficiency:

  • “Studies show multi-GPU clusters rarely give perfectly linear scaling. With 8 GPUs, you often get just 75–85% of the speed you expect—inter-GPU communication slows things down, especially for real-time inference.”

Cost and Agility Advantage:

  • “With p5.4xlarge pricing around $3.90/hr, small teams can run production-grade GenAI for a few dollars, no need for huge cluster commitments.”

Single H100 Use Cases:

  • “If your model is 13–32B or an MoE with ~13B params active, a single H100 delivers: dev, internal tools, early production, moderate traffic—all without cluster headaches.”

Multi-GPU Use Cases:

  • “Only go multi-GPU if you need true 70B+ scale or must support huge traffic.”

Future Reference:

  • “If you ever push past what a single H100 can do, the H200 is on the horizon with higher throughput.”

  • “But for nearly everyone today, H100 is the sweet spot.”



Summarize the Key Takeaways:

  • “For most open-source LLM workloads, you don’t need a giant cluster—just a single H100 p5.4xlarge tuned to your needs.”

Models That Fit:

  • “We’ve seen the models that fit—Qwen3‑32B, Mistral, Llama2‑13B, Mixtral—all run smoothly on a single H100.”

Modern Architectures:

  • “We explored three modern architectures—chat APIs, RAG, agentic workflows—each simplified and scalable with one H100.”

Cost and Complexity Savings:

  • “And you’re saving money and complexity, only moving to clusters for ultra-high concurrency or largest models.”


Team:

AWS FSI Customer Acceleration Hong Kong

AWS Amarathon Fan Club

AWS Community Builder Hong Kong

Top comments (0)