How we optimized a local AI agent's web search by finding the real bottleneck.
A story of missteps, failed experiments, and eventually finding the true cause.
"It's a simple detail in retrospect, but I had an issue with my web search pipeline. Using the raw search results meant passing thousands of tokens to the next input, so I implemented a very small model to summarize the data first. However, because Ollama defaults to sequential execution, the process took 3x longer. It seems obvious now, but it was a major bottleneck before I analyzed it! I also optimized it by disabling non-text rendering (images, CSS, etc.)."
Background
Xoul is a local AI agent that answers user queries by searching the web. The pipeline is: search → visit URLs → extract text → generate LLM response. A simple question like "How many pages is [book title]?" was either failing completely or taking 20+ seconds.
Problem : Single URL Fetch = High Failure Rate
Symptoms
Only visiting 1 search result URL meant that if that site was slow or JS-heavy, the entire search failed.
Fix: Parallel 3-URL Fetch
Using concurrent.futures.ThreadPoolExecutor to fetch the top 3 URLs simultaneously:
with ThreadPoolExecutor(max_workers=3) as pool:
futures = {pool.submit(_fetch_one, url): url for url in fetch_urls}
try:
for future in as_completed(futures, timeout=30):
result = future.result(timeout=1)
if result:
fetched.append(result)
except TimeoutError:
# Keep whatever completed, discard the rest
for future in futures:
if future.done():
fetched.append(future.result(timeout=0))
Result: Only 1 of 3 needs to succeed. Dramatically reduced failure rate.
Failed Experiment: Host-Side Browser Daemon
Hypothesis
"Chrome on the Windows HOST (full CPU/GPU) should be faster than Chromium in the VM (limited resources)."
Implementation
Created browser_daemon_host.py to run Chrome headless on Windows:
- Served on PORT 9224
- VM accesses via
10.0.2.2:9224(QEMU gateway to host) - Chrome CDP with
/json/new+Page.navigate
Single URL Fetch Comparison
| HOST Chrome | VM Chromium | |
|---|---|---|
| Single URL fetch | 1.5s | ~2s |
| Difference | 0.5s faster | baseline |
But...
End-to-end API test results:
- With HOST Chrome: 20.0s
- VM Chromium only: 20.4s
0.4s difference. Why? Browser fetch is only ~2s of the total 20s. The remaining 18s was spent elsewhere.
Conclusion
Host daemon added complexity with negligible benefit. Scrapped. A classic case of optimizing the wrong bottleneck.
Finding the Real Bottleneck: 3x Serial LLM Summarization
Code Analysis
web_search("book page count")
├─ DDG search ~1s
├─ 3 URL fetch (parallel) ~2s
├─ 3 LLM summaries (qwen3:0.6b) ~12s ← HERE!
│ ├─ URL1 → _summarize_with_llm ~4s
│ ├─ URL2 → _summarize_with_llm ~4s
│ └─ URL3 → _summarize_with_llm ~4s
└─ Final LLM response (gpt-oss:20b) ~5s
After fetching each page, tool_fetch_url calls qwen3:0.6b to summarize the raw HTML into structured data (title, author, page count, ISBN, etc.). This was designed to save tokens for the main LLM, but had critical issues:
- 3 summaries run serially — Ollama defaults to processing 1 request at a time
-
Model swapping — switching between
gpt-oss:20bandqwen3:0.6badds overhead - GPU bandwidth sharing — even with "parallel" requests, they share the same GPU
Why Not Just Truncate?
We considered removing LLM summarization and just truncating text to 3000 chars. However, prior testing showed important data (like page count buried deep in the page) could be lost. The 0.6b model intelligently extracts structured info regardless of position in the page.
Solution: Ollama Parallel + Multi-Model Config
# Set before starting Ollama in launcher.ps1
$env:OLLAMA_NUM_PARALLEL = "4" # Handle 4 concurrent requests
$env:OLLAMA_MAX_LOADED_MODELS = "3" # Keep 3 models in VRAM simultaneously
This keeps all models resident in VRAM:
-
gpt-oss:20b(17GB) — main conversation -
qwen3:0.6b(8GB) — web page summarization -
bge-m3(1.2GB) — memory embeddings
Zero model swapping, true parallel processing.
Additional daemon improvements:
-
ThreadingHTTPServerfor concurrent request handling - Chromium flags to disable images, fonts, CSS, plugins — optimized for text extraction
Final Results
Test Query: "마음 소화제 뻥뻥수 페이지 수 알려줘" (Book page count query)
| Stage | Before | After |
|---|---|---|
| Web search | ~1s | ~1s |
| URL fetch (1→3 parallel) | ~5s | ~2s |
| LLM summarization (serial→parallel) | ~12s | ~3s |
| Final response | ~5s | ~3s |
| Total | ~20s | ~8s |
Run 1: 8.5s → "마음 소화제 뻥뻥수의 페이지 수는 188쪽입니다."
Run 2: 8.2s → "마음 소화제 뻥뻥수는 188쪽입니다. (source: yes24.com)"
2.5x faster. Accurate answer with source.
Lessons Learned
Measure first, optimize later — We spent hours building a HOST Chrome daemon for a 0.5s improvement on a 2s step, while the real 12s bottleneck was elsewhere.
The bottleneck is never where you expect — "The browser is slow" was our assumption. "Serial LLM calls are slow" was the reality.
One config line beats 300 lines of code —
OLLAMA_NUM_PARALLEL=4solved whatbrowser_daemon_host.py(300 lines) couldn't.Infrastructure reliability > speed — If the browser daemon is dead, speed doesn't matter. Triple auto-start was the most impactful change.
Cold start matters — First test after Ollama restart showed 23.6s (model loading). Second run: 8.5s. Always warm up before benchmarking.
Files Changed
| File | Change |
|---|---|
browser_daemon.py |
ThreadingHTTPServer, disable images/fonts |
tools/web_tools.py |
Parallel 3-URL fetch, TimeoutError handling, SSH fallback |
scripts/deploy.ps1 |
Browser daemon auto enable+start |
scripts/launcher.ps1 |
Browser health check, Ollama parallel config |
vm_manager.py |
Browser daemon start on VM boot |
Top comments (0)