DEV Community

Dev Yadav
Dev Yadav

Posted on • Originally published at luminoai.co.in

The Demo Worked on a 7B Model. Production Traffic Changed the Math.

The demo looked fine on a small model with one user. Then real traffic showed up, latency got ugly, and the original GPU choice stopped making sense.

Why this happens

  • demos are usually tested with tiny load and perfect patience
  • production adds concurrency, queueing, and impatience
  • a model that feels cheap at one request at a time can become painful under real usage
  • people optimize for "it works" instead of "it responds fast enough"

What people get wrong

They think the model choice was wrong. Sometimes the model is fine. The real issue is that the compute plan was sized for a demo, not for production behavior.

Practical rule

  • start with RTX 4090 for small models and light traffic
  • move to A100 80GB when latency and concurrency become the real problem
  • only evaluate H100 when the workload is already clearly heavy

The simple takeaway

If the demo worked and production did not, the lesson is not always "change the model."

Sometimes the model is fine and the GPU plan is still stuck in demo mode.

Compare GPUs

Top comments (0)