Hey everyone,
I'm currently facing some challenges with optimizing the response time of my AWS instance. Here's the setup: I'm using a g5.xlarge
instance which houses a single NVIDIA A10G GPU with 24GB of VRAM. Recently, I fine-tuned a mistralai/Mistral-7B-Instruct-v0.2 model on my custom data and then merged it with the base model. Additionally, I applied quantization methods to optimize further.
However, when I send a request to my fine-tuned model, it's taking approximately 3 minutes to respond, even for requests with a max token of 1024. I'm looking for suggestions on how to reduce this response time.
Furthermore, I've encountered errors when attempting to handle multiple requests simultaneously. Specifically, I've received errors like:
- "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)"
- "The SW shall provide an estimated value for the torque
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions."
Could someone please guide me on how to address these errors and efficiently handle multiple requests simultaneously on my AWS instance?
Any help or advice would be greatly appreciated. Thanks in advance!
Top comments (0)