DEV Community

ObservabilityGuy
ObservabilityGuy

Posted on

16 Performance Boost and 98% Cost Reduction: A Dive into the Upgraded SLS Vector Indexing Architecture

Cost and Throughput Challenges of Vector Indexing in Log Scenarios
In semantic indexing, the embedding process is the key factor that determines the semantic recall rate. Throughout the entire semantic indexing process, embedding also represents a core cost component. The cost of embedding 1 GB of data can reach several hundred CNY, while the speed is limited to about 100 KB/s. In comparison, the costs of index construction and storage are negligible. The inference efficiency of embedding models on GPUs directly determines the speed and total cost of building a semantic index.

For scenarios of knowledge bases, such costs may be acceptable, since knowledge bases are relatively static and infrequently updated. However, for Simple Log Service (SLS) streaming data, new data is continuously generated, which creates significant pressure on both performance and costs. With the cost of a few hundred CNY per gigabyte and a throughput of only 100 KB/s, such performance is unsustainable for production workloads.

To improve performance and cost efficiency for large-scale applications, we conducted systematic optimizations targeting the inference bottlenecks of the embedding service. Through in-depth analysis, solution selection, and customized improvements, we achieved a 16× increase in throughput while significantly reducing resource costs per request.

Technical Challenges and Optimization Strategies
To achieve optimal cost-efficiency of the embedding service, we need to address the following key challenges:

  1. Inference framework:

Multiple inference frameworks exist on the market, such as vLLM, SGLang, llama.cpp, TensorRT, and sentence-transformers, each with different focuses. For example, there are general-purpose and specialized frameworks, as well as CPU and GPU ones. It is crucial to select a framework that best fits embedding workloads and maximizes hardware (especially GPUs) performance.
The intrinsic computational efficiency of a framework for jobs such as continuous batch processing and kernel optimization can also become an inference performance bottleneck for embedding models.

  1. Maximizing GPU utilization: This is the core of cost reduction. GPU resources are so expensive that underutilizing them is such a waste. This is quite different from the way programs operated in the CPU era.

Batch processing: Embedding inference is highly sensitive to the batch size, and the efficiency of processing a single request is much lower than that of batch processing. An efficient request batching mechanism is essential.
Parallel processing: CPU preprocessing (such as tokenization), network I/O, and GPU computation must be fully decoupled and parallelized to prevent GPU idle time.
Multiple model replicas: Unlike large chat models with massive parameters, typical embedding models have fewer parameters. A single replica on an A10 GPU may use only 15% of computing power and 13% of GPU memory. Efficiently deploying multiple model replicas on a single GPU to "use up" the GPU resources is crucial for reducing costs and improving throughput.

  1. Priority-based scheduling:

Semantic indexing involves two stages: index construction (with a large batch size and a low priority) and online query (with a small batch size and high real-time requirement). It is essential to ensure that the embedding tasks for query requests are not blocked by construction tasks. A fine-grained priority queue scheduling mechanism is also required — simple resource pool isolation is insufficient.

  1. Bottlenecks in the E2E pipeline:

After GPU utilization improves, other parts of the pipeline, such as tokenization, may become new bottlenecks.
Solution
We eventually implemented the following optimization solution.

Optimization 1: Selecting vLLM as the Core Inference Engine to Replace llama.cpp
● Our initial choice of llama.cpp was mainly based on its high performance in C++, CPU friendliness (some of our tasks run on CPU nodes), and ease of integration. However, recent test results showed that under the same hardware conditions, the throughput of vLLM or SGLang was twice that of llama.cpp, while the average GPU utilization was 60% lower. We believe the key difference lies in vLLM's Continuous Batching mechanism and its highly optimized CUDA kernels.

● We eventually separated the embedding module as an independent service and deployed it on Elastic Algorithm Service (EAS) of Platform for AI (PAI). This way, both vector construction and query operations obtain embeddings through remote calls. Although this introduces network overhead and additional O&M costs, it delivers a significant basic performance boost and lays a solid foundation for subsequent optimization.

Optimization 2: Deploying Multiple Model Replicas on a Single GPU
● To improve GPU utilization, we needed to deploy multiple model replicas on a single A10 GPU. After evaluating several solutions, we finally chose Triton Inference Server as the service framework. It allows us to easily control the number of model replicas on a single GPU and take advantage of its scheduling and dynamic batching capabilities to route requests to different replicas. In addition, we decided to bypass vLLM HTTP Server and invoke the vLLM core library (LLMEngine) directly in Triton's Python Backend, reducing certain amount of overhead.

Optimization 3: Decoupling Tokenization from Model Inference
● We observed that with multiple vLLM replicas, the tokenization stage became a new performance bottleneck after the GPU throughput was improved. Our tests also showed that the tokenization throughput of llama.cpp was six times higher than that of vLLM. Therefore, we decoupled the tokenization and inference stages. We used llama.cpp for high-performance tokenization, and then passed token IDs to vLLM for inference. This effectively bypassed the tokenization bottleneck of vLLM and further improved the E2E throughput. After we implemented this optimization, we noticed that Snowflake published an article describing a similar approach, indicating that this is a common issue. We are also actively working with the vLLM community to help address this problem.

Optimization 4: Priority Queuing and Dynamic Batching
● Triton Inference Server is built in with a priority queuing mechanism and a dynamic batching mechanism, which align perfectly with the requirements of the embedding service. Embedding requests during query operations are assigned a higher priority to reduce query latency. In addition, we use dynamic batching to group incoming requests in batches, which improves the overall throughput efficiency.

Final Architecture Design
After addressing the performance bottlenecks of embedding, it was also necessary to refactor the overall semantic indexing architecture. The system needed to switch to calling remote embedding services and enable full asynchronization and parallelization across the data reading, chunking, embedding request, and result processing/storage steps.

Embedding Calls
In the previous architecture, the embedded llama.cpp engine was invoked directly for embedding. In the new architecture, embedding is performed through remote calls.

Full Asynchronization and Parallelization
In the old architecture, from data parsing to chunking and then embedding was a fully sequential process, which prevented the embedding service on GPUs from reaching full loads. Therefore, we designed a new architecture that implements full asynchronization and parallelization, efficiently utilizing network I/O, CPU, and GPU resources.

  1. Pipeline Task Orchestration We divided the semantic index construction process into multiple tasks and built them into a directed acyclic graph (DAG) for execution. Different tasks can run asynchronously and in parallel, and within a single task, parallel execution is supported. Overall process:

DeserializeDataTask → ChunkingTask (parallel) → GenerateBatchTask → EmbeddingTask (parallel) → CollectEmbeddingResultTask → BuildIndexTask → SerializeTask → FinishTask

  1. Pipeline Scheduling Framework To efficiently execute pipeline tasks, we also implemented a data- and event-driven scheduling framework.

  1. Fully Redesigned Construction Process Through complete code modifications, we achieved a major architectural leap, enabling high-performance semantic index construction.

Conclusion: Higher Throughput and Cost Efficiency
After the full pipeline transformation, tests showed the following results:

The throughput increased from 170 KB/s to 3 MB/s.
The SLS vector indexing service is priced at CNY 0.01 per million tokens, offering a cost advantage of two orders of magnitude compared with industry alternatives.
You are welcome to use this service. For more information, see the usage guide.

Top comments (0)