Seeking advice on optimizing response time and handling multiple requests on AWS instance with NVIDIA A10G GPU

#llm #python #ai #machinelearning

Hey everyone,

I'm currently facing some challenges with optimizing the response time of my AWS instance. Here's the setup: I'm using a g5.xlarge instance which houses a single NVIDIA A10G GPU with 24GB of VRAM. Recently, I fine-tuned a mistralai/Mistral-7B-Instruct-v0.2 model on my custom data and then merged it with the base model. Additionally, I applied quantization methods to optimize further.

However, when I send a request to my fine-tuned model, it's taking approximately 3 minutes to respond, even for requests with a max token of 1024. I'm looking for suggestions on how to reduce this response time.

Furthermore, I've encountered errors when attempting to handle multiple requests simultaneously. Specifically, I've received errors like:

"Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)"
"The SW shall provide an estimated value for the torque CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions."

Could someone please guide me on how to address these errors and efficiently handle multiple requests simultaneously on my AWS instance?

Any help or advice would be greatly appreciated. Thanks in advance!

DEV Community

Seeking advice on optimizing response time and handling multiple requests on AWS instance with NVIDIA A10G GPU

Top comments (0)

Read next

Tiny AI Safety Guard Matches Larger Models with 98% Accuracy, Runs on Phones

Top 7 Data Careers You Should Know About in 2025

AI Breakthroughs: Language Models Can Now Control Computer Interfaces Like Humans

Behavioral Questions in AI Interviews: 2025 Insights