When developing LLM-based apps, we often encounter a constraint where Ollama with LLM model process
es only one request at a time, keeping other requests in the queue. While horizontal scaling of physical GPU servers hosting Ollama + LLM Model + App may be the optimal solution for production environments, it may not be feasible during product PoC time due to infrastructure costs.
To overcome this, it's crucial to perform a concurrency check of an app with LLM models and assess the number of requests that a GPU server can accommodate. Luckily, there is a solution to deploy multiple instances of Ollama + LLM Model + App in a single GPU server that can be combined with physical server horizontal scaling for production environments.
If you're facing similar constraints during your product PoC time, consider this deployment architecture to optimize your infrastructure costs. hashtag#LLMModel hashtag#GPU hashtag#DeploymentArchitecture hashtag#ProductPoC
Top comments (0)