How We Reduced LLM Latency by 89% and Token Usage by 91% in a Production Chrome Extension

#ai #architecture #llm #performance

Introduction

When building our AI-powered bookmark organizer, Simmark, our primary goal was to eliminate user friction. Unlike other tools, we bypass the need for users to manually generate and input API keys by handling the LLM integration directly through our backend environment.

However, our initial implementation was heavily unoptimized. Processing 200 bookmarks took an average of 62.74 seconds. This latency was unacceptable for a seamless user experience.

The Architecture Optimization

We went through five backend iterations to stabilize the AI processing pipeline. Here are the core structural changes that resolved our bottlenecks.

1. Flattening the Request/Response PayloadsInitially, we sent the user's bookmarks as a nested JSON tree structure to the LLM. This caused severe context parsing issues for the model, leading to missing brackets, JSON format violations, and occasional looping.

By converting the hierarchical tree into a flat array structure before prompt insertion, we minimized the structural complexity. We also enforced the LLM to output a flat structure. Removing the nested hierarchy eliminated parsing errors and drastically reduced unnecessary token consumption.

2. Delegating Deterministic Logic to the Application LayerIn our early versions, we relied on the LLM to sort items by view count and filter out duplicate IDs. We realized that offloading deterministic tasks to a probabilistic model is inefficient.

We shifted the sorting logic and duplicate removal entirely to our backend application layer. The backend now receives the flat JSON response from the LLM, recovers any omitted bookmark IDs (a common hallucination issue), removes duplicates, and reconstructs the final tree structure. Let the AI categorize the domains; let the application code handle the exact sorting.

The ResultsBy restructuring the data payload and separating responsibilities between the LLM and the application backend, we achieved the following metrics in our benchmark (100 bookmarks, 30 iterations):

Average response time dropped from 62.74s to 6.78s (89% reduction).
Average output tokens decreased from 25,403 to 2,403 (91% reduction).
Processing error rate dropped from 16.7% to 0.0%.

Try It Out

If you want to see the performance of the optimized backend pipeline, you can test the extension here:

https://chromewebstore.google.com/detail/simmark-ai-bookmark-manag/kmblaifgcnldcklbceioinenknioaaae?hl=en

It automatically groups your messy bookmarks by domain or topic through a chat interface. It works immediately without requiring any setup or API keys.

I am open to any feedback regarding backend architecture, prompt engineering, or Chrome extension development.

DEV Community

How We Reduced LLM Latency by 89% and Token Usage by 91% in a Production Chrome Extension

Top comments (0)