DEV Community

Kim Namhyun
Kim Namhyun

Posted on

Xoul, AI Agent : Optimizing Web Search — From 20s to 8s

How we optimized a local AI agent's web search by finding the real bottleneck.
A story of missteps, failed experiments, and eventually finding the true cause.

​"It's a simple detail in retrospect, but I had an issue with my web search pipeline. Using the raw search results meant passing thousands of tokens to the next input, so I implemented a very small model to summarize the data first. However, because Ollama defaults to sequential execution, the process took 3x longer. It seems obvious now, but it was a major bottleneck before I analyzed it! I also optimized it by disabling non-text rendering (images, CSS, etc.)."

Background

Xoul is a local AI agent that answers user queries by searching the web. The pipeline is: search → visit URLs → extract text → generate LLM response. A simple question like "How many pages is [book title]?" was either failing completely or taking 20+ seconds.

Problem : Single URL Fetch = High Failure Rate

Symptoms

Only visiting 1 search result URL meant that if that site was slow or JS-heavy, the entire search failed.

Fix: Parallel 3-URL Fetch

Using concurrent.futures.ThreadPoolExecutor to fetch the top 3 URLs simultaneously:

with ThreadPoolExecutor(max_workers=3) as pool:
    futures = {pool.submit(_fetch_one, url): url for url in fetch_urls}
    try:
        for future in as_completed(futures, timeout=30):
            result = future.result(timeout=1)
            if result:
                fetched.append(result)
    except TimeoutError:
        # Keep whatever completed, discard the rest
        for future in futures:
            if future.done():
                fetched.append(future.result(timeout=0))
Enter fullscreen mode Exit fullscreen mode

Result: Only 1 of 3 needs to succeed. Dramatically reduced failure rate.

Failed Experiment: Host-Side Browser Daemon

Hypothesis

"Chrome on the Windows HOST (full CPU/GPU) should be faster than Chromium in the VM (limited resources)."

Implementation

Created browser_daemon_host.py to run Chrome headless on Windows:

  • Served on PORT 9224
  • VM accesses via 10.0.2.2:9224 (QEMU gateway to host)
  • Chrome CDP with /json/new + Page.navigate

Single URL Fetch Comparison

HOST Chrome VM Chromium
Single URL fetch 1.5s ~2s
Difference 0.5s faster baseline

But...

End-to-end API test results:

  • With HOST Chrome: 20.0s
  • VM Chromium only: 20.4s

0.4s difference. Why? Browser fetch is only ~2s of the total 20s. The remaining 18s was spent elsewhere.

Conclusion

Host daemon added complexity with negligible benefit. Scrapped. A classic case of optimizing the wrong bottleneck.

Finding the Real Bottleneck: 3x Serial LLM Summarization

Code Analysis

web_search("book page count")
 ├─ DDG search                        ~1s
 ├─ 3 URL fetch (parallel)            ~2s
 ├─ 3 LLM summaries (qwen3:0.6b)     ~12s ← HERE!
 │   ├─ URL1 → _summarize_with_llm    ~4s
 │   ├─ URL2 → _summarize_with_llm    ~4s
 │   └─ URL3 → _summarize_with_llm    ~4s
 └─ Final LLM response (gpt-oss:20b)  ~5s
Enter fullscreen mode Exit fullscreen mode

After fetching each page, tool_fetch_url calls qwen3:0.6b to summarize the raw HTML into structured data (title, author, page count, ISBN, etc.). This was designed to save tokens for the main LLM, but had critical issues:

  1. 3 summaries run serially — Ollama defaults to processing 1 request at a time
  2. Model swapping — switching between gpt-oss:20b and qwen3:0.6b adds overhead
  3. GPU bandwidth sharing — even with "parallel" requests, they share the same GPU

Why Not Just Truncate?

We considered removing LLM summarization and just truncating text to 3000 chars. However, prior testing showed important data (like page count buried deep in the page) could be lost. The 0.6b model intelligently extracts structured info regardless of position in the page.

Solution: Ollama Parallel + Multi-Model Config

# Set before starting Ollama in launcher.ps1
$env:OLLAMA_NUM_PARALLEL = "4"       # Handle 4 concurrent requests
$env:OLLAMA_MAX_LOADED_MODELS = "3"  # Keep 3 models in VRAM simultaneously
Enter fullscreen mode Exit fullscreen mode

This keeps all models resident in VRAM:

  • gpt-oss:20b (17GB) — main conversation
  • qwen3:0.6b (8GB) — web page summarization
  • bge-m3 (1.2GB) — memory embeddings

Zero model swapping, true parallel processing.

Additional daemon improvements:

  • ThreadingHTTPServer for concurrent request handling
  • Chromium flags to disable images, fonts, CSS, plugins — optimized for text extraction

Final Results

Test Query: "마음 소화제 뻥뻥수 페이지 수 알려줘" (Book page count query)

Stage Before After
Web search ~1s ~1s
URL fetch (1→3 parallel) ~5s ~2s
LLM summarization (serial→parallel) ~12s ~3s
Final response ~5s ~3s
Total ~20s ~8s
Run 1: 8.5s → "마음 소화제 뻥뻥수의 페이지 수는 188쪽입니다."
Run 2: 8.2s → "마음 소화제 뻥뻥수는 188쪽입니다. (source: yes24.com)"
Enter fullscreen mode Exit fullscreen mode

2.5x faster. Accurate answer with source.

Lessons Learned

  1. Measure first, optimize later — We spent hours building a HOST Chrome daemon for a 0.5s improvement on a 2s step, while the real 12s bottleneck was elsewhere.

  2. The bottleneck is never where you expect — "The browser is slow" was our assumption. "Serial LLM calls are slow" was the reality.

  3. One config line beats 300 lines of codeOLLAMA_NUM_PARALLEL=4 solved what browser_daemon_host.py (300 lines) couldn't.

  4. Infrastructure reliability > speed — If the browser daemon is dead, speed doesn't matter. Triple auto-start was the most impactful change.

  5. Cold start matters — First test after Ollama restart showed 23.6s (model loading). Second run: 8.5s. Always warm up before benchmarking.

Files Changed

File Change
browser_daemon.py ThreadingHTTPServer, disable images/fonts
tools/web_tools.py Parallel 3-URL fetch, TimeoutError handling, SSH fallback
scripts/deploy.ps1 Browser daemon auto enable+start
scripts/launcher.ps1 Browser health check, Ollama parallel config
vm_manager.py Browser daemon start on VM boot

Top comments (0)