DEV Community

Cover image for Day 0 Benchmark: Deploying DeepSeek-V4-Flash-DSpark on GPUStack Doubles Throughput
GPUStack
GPUStack

Posted on

Day 0 Benchmark: Deploying DeepSeek-V4-Flash-DSpark on GPUStack Doubles Throughput

This article is based on a community benchmark contributed by a GPUStack user. DeepSeek-V4-Flash-DSpark enhances DeepSeek-V4-Flash by adding a Speculative Decoding module. Using the same model weights with an additional speculative decoder, it significantly improves both inference throughput and Time to First Token (TTFT).

On Day 0 of the model release, a GPUStack community member deployed and benchmarked DeepSeek-V4-Flash-DSpark on an 8× H20-141G setup, comparing it against the original DeepSeek-V4-Flash (DSV4F) under identical deployment settings. Here are the key results:

  • Single-stream throughput: In the 1K input / 1K output workload, DSpark achieved 195 tokens/s, approximately the throughput of the original model (96 tokens/s);
  • Overall throughput: In the 64K input / 3K output workload with 10 concurrent requests, DSpark reached 338 tokens/s, approximately 1.7× higher than the original (198 tokens/s);
  • Time to First Token (TTFT): Reduced to roughly half that of the original model.

The following sections walk through the complete deployment process and benchmark results.


1. Deploying DSpark on GPUStack

GPUStack comes with the SGLang inference backend built in. To deploy DSpark, simply add a container image that supports the model. The entire process can be completed in just a few clicks through the web UI.

① Go to Inference Backends and edit SGLang.

From the left navigation menu, go to Inference Backends, locate the SGLang card, then click the menu in the upper-right corner and select Edit.

② Add a DSpark-compatible image

Under Version Configuration, click Add Version to create a new version named dspark, then specify the following container image:

swr.cn-north-4.myhuaweicloud.com/desaysv/gpustack/sglang-dspark:v1.0
Enter fullscreen mode Exit fullscreen mode

Select CUDA as the framework. Set the image entrypoint override to sglang serve, and set the command to: --model-path {{model_path}} --host {{worker_ip}} --port {{port}}

③ Create a Deployment

Return to the Deployments page and click Deploy Model in the upper-right corner. For Source, select ModelScope.

④ Select the Model and Inference Backend

  • Search for and select deepseek-ai/DeepSeek-V4-Flash-DSpark.
  • Set the Inference Backend to SGLang.
  • Select dspark-custom as the backend version.

⑤ Configure Backend Parameters

Under Advanced, configure the backend parameters as follows (using an 8× H20-141G setup as an example):

--context-length 1000000
--trust-remote-code
--tp-size 8
--ep-size 8
--moe-runner-backend flashinfer_mxfp4
--speculative-moe-runner-backend flashinfer_mxfp4
--speculative-algorithm DSPARK
--speculative-eagle-topk 1
--speculative-num-steps 1
--mem-fraction-static 0.85
--cuda-graph-max-bs 32
--max-running-requests 32
--disable-overlap-schedule
Enter fullscreen mode Exit fullscreen mode

⑥ Configure Environment Variables

Add the following environment variable to ensure the required dependencies are installed correctly:

Key Value
PYPI_PACKAGES_INSTALL -U distro -i https://mirrors.aliyun.com/pypi/simple/

⑦ Start the Deployment and Monitor the Logs

Once submitted, GPUStack will start the deployment. In the logs, you should see CUDA Graph capture, Application startup complete, and Uvicorn listening on the inference port, indicating that the model has started successfully:

⑧ Verify the Deployment Status

Once the instance reaches the Running state, the deployment is complete.

⑨ Verify the Deployment

Open the Playground and send a few prompts to the model. The real-time throughput indicator in the lower-right corner should show an output rate of 185.94 tokens/s, with single-stream throughput remaining stable at around 200 tokens/s.

⑩ View the Inference Service Port

If you want to run benchmarks directly against the service, open the instance details to find the inference service IP address and port (in this example, 10.91.3.213:40048).


2. Benchmark Results: DSpark vs. DSV4F

Under identical hardware and deployment settings, we benchmarked the original DeepSeek-V4-Flash (with MTP enabled) against DSpark across two workloads. All benchmarks were conducted using SGLang's built-in bench_serving tool.

Scenario 1: 1K Input / 1K Output (Single Request)

HF_ENDPOINT=https://hf-mirror.com python3 -m sglang.bench_serving \
    --backend sglang --port 40048 \
    --dataset-name random --random-input-len 1024 --random-output-len 1024 \
    --random-range-ratio 1.0 --num-prompts 1 \
    --max-concurrency 1 --request-rate inf --host <Inference Server IP Address>
Enter fullscreen mode Exit fullscreen mode

Original DSV4F: Output throughput: 96.20 tokens/s, TTFT: 300.45 ms, Accept length: 2.71

DSpark(DSV4FD): Output throughput 195.18 tok/s,TTFT 129.34 ms,Accept length 4.42

Single-stream throughput reached 195 tokens/s, approximately that of the original DSV4F (96 tokens/s), while Time to First Token (TTFT) was reduced to about half.

Scenario 2: 64K Input / 3K Output (10 Concurrent Requests)

HF_ENDPOINT=https://hf-mirror.com python3 -m sglang.bench_serving \
    --backend sglang --port 40048 \
    --dataset-name random --random-input-len 64000 --random-output-len 3000 \
    --random-range-ratio 1.0 --num-prompts 10 \
    --max-concurrency 1 --request-rate inf --host <Inference Server IP Address>
Enter fullscreen mode Exit fullscreen mode

Original DSV4F (MTP enabled): Output throughput: 198.60 tokens/s, Speculative acceptance rate: 20.91%, Acceptance length: 1.21

DSpark(DSV4FD): Output throughput 338.17 tok/s,Accept length 4.90

In the long-context workload, DSpark achieved 338 tokens/s, approximately 1.7× the throughput of the original DSV4F (198 tokens/s), nearly doubling overall throughput.

Summary

Workload Metric Original DSV4F DSpark (DSV4FD) Improvement
1K Input / 1K Output Output Throughput (tokens/s) 96.20 195.18 ≈ 2.0×
1K Input / 1K Output TTFT (ms) 300.45 129.34 ≈ 0.43×
1K Input / 1K Output Acceptance Length 2.71 4.42
64K Input / 3K Output Output Throughput (tokens/s) 198.60 338.17 ≈ 1.7×
64K Input / 3K Output Acceptance Length 1.21 4.90

Conclusion

Across both single-request and long-context workloads, DeepSeek-V4-Flash-DSpark consistently delivered around 2× the throughput of the original DSV4F, while reducing Time to First Token (TTFT) to roughly half. Achieving these gains requires nothing more than switching to the DSpark model weights and container image with the integrated speculative decoding module.

With GPUStack's built-in SGLang inference backend, the entire deployment can be completed in just a few clicks through the web UI, making the model production-ready on Day 0.

Benchmark environment: 8× H20-141G | GPUStack v2 | SGLang 0.5.14 (using the dspark-custom image based on a patched SGLang 0.5.14)

Acknowledgments: Thanks to the GPUStack community member for sharing the benchmark results.

Top comments (0)