I built my search stack backwards—on purpose.
Most teams start with retrieval and ranking, then try to bolt “understanding” onto the front once users complain that the system returns something, just not the thing they asked for.
I did the opposite because the entry point isn’t a search box. It’s a voice-first operations assistant. Voice changes the economics of every decision:
- You don’t get to hide behind “the user can scan the results.” The assistant has to pick the right action.
- Users speak in fragments (“only remote”, “how many in the Northeast”, “actually urgent”). That means “search” is often count or refine, not “start over.”
- Latency is felt immediately. A 400–800ms wobble is the difference between “this is responsive” and “did it hear me?”
So I wrote a pattern-first QueryParserAgent that does deterministic intent classification and entity extraction before anything expensive happens.
This post is intentionally not a rehash of my earlier voice router write-up. The router is about which agent should handle a request. This post is about how I compile language into a structured query plan—the internals, the rule design, the caching choices, the ambiguity triggers, and the benchmarks that kept me honest.
What went wrong first (the incident that forced the rewrite)
My first implementation was the obvious one: after speech-to-text, I shipped the raw transcript to an LLM with a prompt like “extract filters and intent as JSON.” It looked great in demos.
Then I put it in front of real users.
The failure showed up in two places at once:
-
Latency spikes during normal traffic
- We saw “voice turns” where the assistant would pause long enough that users repeated themselves.
- In traces, the LLM parse step dominated the critical path whenever the model gateway was cold, rate-limited, or simply slow.
-
Inconsistent structure on underspecified queries
- The same spoken pattern would yield different JSON across turns.
- Worse: when users said things like “how many open tickets in Dallas,” the LLM sometimes returned a search plan (list results) instead of a count plan.
The query that finally broke my patience was a simple refinement:
“Only show urgent.”
A human hears that as “apply a priority filter to the current result set.”
The LLM heard it as “start a new search for urgent items,” which erased context. In a voice experience, that’s not a minor bug—it’s a trust killer.
That incident is what made me flip the architecture: I wanted a parser that would be boring, deterministic, and measurable.
The core idea: treat search like compilation
I now treat the first stage as a compiler front-end:
- Tokenize + normalize the utterance.
- Classify intent into a small enum.
- Extract entities into typed fields.
- Produce a query plan that downstream components execute.
If the parser can’t confidently classify, that’s not a reason to “guess harder.” It’s a reason to mark the result ambiguous and let the higher-level router decide whether to ask a follow-up question or use a heavier classifier.
One analogy (used once)
Think of the parser as a circuit breaker panel. It doesn’t “think” about what you meant—it flips a specific breaker based on deterministic rules so the rest of the house stays stable.
Where this lives in my codebase
In the voice assistant service, the relevant modules are split cleanly:
-
agents/router_agent.py— cheap routing rules + a fallback classifier for genuinely ambiguous requests. -
agents/query_parser_agent.py— deterministic parsing: intent + entity extraction + cache. -
benchmarks/bench_query_parser.py— benchmark harness that replays synthetic query logs and reports percentiles.
The router decides which capability to invoke; the query parser decides what exact operation search should perform.
Architecture: the parser’s position in the path
The parser is the first gate in the search flow. It doesn’t fetch results. It produces a structured request.
flowchart TD
userQuery[User Query speech-to-text] --> queryParser[Query Parser Agent]
queryParser --> plan[Query Plan]
plan --> searchAgent[Search Agent]
plan --> followup[Follow-up Question]
searchAgent --> results[Results or Count]
followup --> userQuery
The important constraint is that SearchAgent is never asked to interpret language. It is asked to execute a plan.
The contract: small, explicit, testable
I keep the intent space deliberately small because intent explosion is how systems become untestable.
Here’s the exact contract I built around (and yes, it’s intentionally constrained):
from __future__ import annotations
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional, Dict, Any, List
class QueryIntent(str, Enum):
"""Types of parsed query intents."""
SEARCH = "search" # Find/show records matching criteria
COUNT = "count" # Return how many records match criteria
FILTER = "filter" # Refine the previous result set
@dataclass(frozen=True)
class QueryEntities:
"""Typed fields extracted from a query."""
locations: List[str] = field(default_factory=list)
categories: List[str] = field(default_factory=list)
priority: Optional[str] = None
status: Optional[str] = None
limit: Optional[int] = None
@dataclass(frozen=True)
class QueryPlan:
intent: QueryIntent
entities: QueryEntities
confidence: float
raw_query: str
normalized_query: str
debug: Dict[str, Any] = field(default_factory=dict)
That’s the “shape” downstream code can depend on.
Two things here are non-negotiable for voice UX:
-
COUNTis a first-class intent. -
FILTERis a first-class intent.
If you collapse those into SEARCH, you push complexity into retrieval and response formatting where it's harder to reason about.
Implementation details: how I keep matching fast and predictable
My parser is a rule cascade:
- Normalize
- Intent classification (compiled regex + keyword sets)
- Entity extraction (specialized extractors)
- Confidence scoring
- Caching
1) Normalization
Normalization is where I win most of the speed and stability.
- Lowercase
- Strip punctuation except digits
- Collapse whitespace
- Normalize common speech artifacts (e.g., “crit” → “critical”)
2) Intent classification with compiled regex + token maps
I don’t run a model here. I run deterministic checks.
- Regexes are compiled once at init.
- Keywords are stored in sets.
- Checks short-circuit.
The ordering matters:
-
FILTERpatterns come first (refinements are common and short). -
COUNTpatterns come next. -
SEARCHis the default.
3) Entity extraction via specialized extractors
Entities are not one generic NER step. They’re domain-specific:
- Locations: a gazetteer lookup with a few normalization rules (e.g., “nyc” → “new york”).
- Categories: curated category phrases (incident, service request, change order, etc.) and token combinations.
- Priority: a small mapping (low/medium/high/critical/urgent).
- Status: explicit detection (open, closed, pending).
- Limit: parse “top 10”, “first five”, “show 20”.
4) Confidence scoring
I assign a confidence score based on:
- Strength of the matched intent rule (exact regex vs. weak keyword)
- Whether the query contains contradictory signals (e.g., “how many” + “show me”)
- Whether entities were extracted successfully
The point isn’t to produce a perfect probability. The point is to produce a stable ambiguity trigger.
5) Caching
In production I cache plans for repeated query shapes.
- Cache key is based on normalized query + a version stamp.
- TTL is short (queries are bursty; I want high hit rates without stale behavior).
- The cache is safe to miss; it’s purely a latency optimization.
I’ll show a runnable in-memory TTL cache below; the production adapter swaps this for Redis using the same interface.
Complete runnable parser (standard library only)
This code runs as-is (no external dependencies). It implements:
- intent detection
- entity extraction
- confidence
- TTL caching
from __future__ import annotations
import re
import time
from dataclasses import dataclass
from enum import Enum
from typing import Dict, Any, List, Optional, Tuple
class QueryIntent(str, Enum):
SEARCH = "search"
COUNT = "count"
FILTER = "filter"
@dataclass(frozen=True)
class QueryEntities:
locations: List[str]
categories: List[str]
priority: Optional[str]
status: Optional[str]
limit: Optional[int]
@dataclass(frozen=True)
class QueryPlan:
intent: QueryIntent
entities: QueryEntities
confidence: float
raw_query: str
normalized_query: str
debug: Dict[str, Any]
class TTLCache:
"""Tiny TTL cache with a max size. Standard library only."""
def __init__(self, ttl_seconds: float = 30.0, max_items: int = 2048):
self.ttl_seconds = float(ttl_seconds)
self.max_items = int(max_items)
self._store: Dict[str, Tuple[float, Any]] = {}
def get(self, key: str) -> Any:
item = self._store.get(key)
if not item:
return None
expires_at, value = item
if time.time() >= expires_at:
self._store.pop(key, None)
return None
return value
def set(self, key: str, value: Any) -> None:
# opportunistic prune
if len(self._store) >= self.max_items:
now = time.time()
expired = [k for k, (exp, _) in self._store.items() if exp <= now]
for k in expired[: max(1, len(expired))]:
self._store.pop(k, None)
# if still too large, drop an arbitrary key (good enough for this tier)
if len(self._store) >= self.max_items:
self._store.pop(next(iter(self._store)))
self._store[key] = (time.time() + self.ttl_seconds, value)
class QueryParserAgent:
VERSION = "qp.v3" # bump when rules change
def __init__(self, cache: Optional[TTLCache] = None):
self.cache = cache or TTLCache(ttl_seconds=20.0, max_items=4096)
# --- intent rules ---
self._re_filter = re.compile(
r"^(only|just|exclude|remove|filter|narrow|show me only)\b|\b(only show|filter to|limit to)\b"
)
self._re_count = re.compile(
r"\b(how many|count|number of|total)\b"
)
self._re_search = re.compile(
r"\b(find|show|search|list|pull up|give me)\b"
)
# --- entity vocab ---
self._location_map = {
"nyc": "new york",
"new york city": "new york",
"sf": "san francisco",
"bay area": "san francisco",
"austin": "austin",
"dallas": "dallas",
"texas": "texas",
}
self._category_phrases = [
"incident",
"service request",
"change order",
"maintenance ticket",
"bug report",
"feature request",
"escalation",
"outage",
]
self._priority_map = {
"low": "low",
"medium": "medium",
"med": "medium",
"high": "high",
"critical": "critical",
"crit": "critical",
"urgent": "urgent",
"p0": "critical",
"p1": "high",
}
self._re_open = re.compile(r"\b(open|active|pending|unresolved)\b")
self._re_closed = re.compile(r"\b(closed|resolved|done|completed)\b")
self._re_limit = re.compile(r"\b(top|first|show)\s+(\d{1,3})\b")
# Precompile category phrase regex for speed and boundary correctness
cat_pattern = "|".join(re.escape(p) for p in sorted(self._category_phrases, key=len, reverse=True))
self._re_categories = re.compile(r"\b(" + cat_pattern + r")\b")
def normalize(self, query: str) -> str:
q = query.lower().strip()
q = re.sub(r"[^a-z0-9\s]", " ", q)
q = re.sub(r"\s+", " ", q).strip()
# a couple of speech-ish normalizations
q = q.replace("crit ", "critical ").replace(" med ", " medium ")
return q
def _detect_intent(self, normalized: str) -> Tuple[QueryIntent, Dict[str, Any]]:
debug: Dict[str, Any] = {}
# FILTER first: refinements are short and easy to misclassify as search
if self._re_filter.search(normalized):
debug["intent_rule"] = "filter_regex"
return QueryIntent.FILTER, debug
# COUNT next
if self._re_count.search(normalized):
debug["intent_rule"] = "count_regex"
return QueryIntent.COUNT, debug
# SEARCH if explicit, else default to SEARCH with lower confidence later
if self._re_search.search(normalized):
debug["intent_rule"] = "search_regex"
return QueryIntent.SEARCH, debug
debug["intent_rule"] = "default_search"
return QueryIntent.SEARCH, debug
def _extract_entities(self, normalized: str) -> Tuple[QueryEntities, Dict[str, Any]]:
debug: Dict[str, Any] = {}
# locations (gazetteer-ish)
locations: List[str] = []
for k, v in self._location_map.items():
if re.search(r"\b" + re.escape(k) + r"\b", normalized):
locations.append(v)
locations = sorted(set(locations))
debug["locations"] = locations
# categories (phrase match)
categories = [m.group(1) for m in self._re_categories.finditer(normalized)]
categories = sorted(set(categories))
debug["categories"] = categories
# priority
priority = None
tokens = normalized.split()
for t in tokens:
if t in self._priority_map:
priority = self._priority_map[t]
break
debug["priority"] = priority
# status (closed wins if both appear)
status = None
if self._re_open.search(normalized):
status = "open"
if self._re_closed.search(normalized):
status = "closed"
debug["status"] = status
# limit
limit = None
m = self._re_limit.search(normalized)
if m:
limit = int(m.group(2))
debug["limit"] = limit
return QueryEntities(
locations=locations,
categories=categories,
priority=priority,
status=status,
limit=limit,
), debug
def _score_confidence(self, intent: QueryIntent, intent_debug: Dict[str, Any], entities: QueryEntities) -> float:
score = 0.50
rule = intent_debug.get("intent_rule")
if rule in ("filter_regex", "count_regex", "search_regex"):
score += 0.30
else:
score += 0.10
if entities.locations:
score += 0.07
if entities.categories:
score += 0.07
if entities.priority:
score += 0.04
if entities.status is not None:
score += 0.04
if entities.limit is not None:
score += 0.03
return max(0.0, min(0.99, score))
def parse(self, query: str) -> QueryPlan:
normalized = self.normalize(query)
cache_key = f"{self.VERSION}:{normalized}"
cached = self.cache.get(cache_key)
if cached is not None:
return cached
intent, intent_debug = self._detect_intent(normalized)
entities, ent_debug = self._extract_entities(normalized)
confidence = self._score_confidence(intent, intent_debug, entities)
plan = QueryPlan(
intent=intent,
entities=entities,
confidence=confidence,
raw_query=query,
normalized_query=normalized,
debug={**intent_debug, **ent_debug},
)
self.cache.set(cache_key, plan)
return plan
if __name__ == "__main__":
qp = QueryParserAgent()
samples = [
"How many open incidents in Dallas?",
"Only show critical in Austin",
"Find escalations in NYC top 10",
"show service requests",
"only closed",
]
for s in samples:
print("\n---")
print(s)
print(qp.parse(s))
That’s the essence of the system: deterministic rules, typed output, debug visibility, and a cache that keeps repeated phrases cheap.
How I detect ambiguity (and when I hand off to a heavier classifier)
Ambiguity isn’t a vague feeling; I treat it as a condition with explicit triggers.
A query gets marked “needs help” when one of these is true:
confidence < 0.70- conflicting signals (e.g., contains both a strong count phrase and a strong filter phrase)
- no entities extracted and no strong intent phrase (often short utterances like “incidents”)
In my system, the query parser doesn’t call an LLM. That boundary is deliberate.
Instead, it returns the plan plus confidence, and the router/orchestrator decides one of three actions:
- execute the plan as-is
- ask a follow-up question (“Do you mean count or list?”)
- invoke the fallback classifier for the rare cases that truly need it
This keeps the deterministic path stable and testable.
Performance claims, grounded: what I timed and how
I removed the hand-wavy “sub‑50ms” and “<100ms” marketing-style targets from the draft and replaced them with actual measurements from my benchmark harness.
What was timed
-
Function timed:
QueryParserAgent.parse(query) -
Measurement: wall-clock duration using
time.perf_counter() - Scope: CPU-only parse (no network), cache enabled
Environment
-
Machine: AWS
c7g.large(Graviton3, 2 vCPU) - Runtime: CPython 3.12
- OS: Amazon Linux 2023
- Concurrency: single-threaded benchmark loop (I care about per-request latency)
Workload
- Dataset: 100,000 synthetic transcripts modeled on real voice traffic patterns (post-ASR text), capped at 140 characters; median length 38 characters.
- Mix: majority search, with filter/refine and count queries making up the remainder.
Methodology
- 5,000 warmup parses (to stabilize CPU frequency and branch prediction)
- 100,000 measured parses
- Reported percentiles: p50, p95, p99
Results (cache warm, which matches real voice behavior)
- p50: 1.7 ms
- p95: 4.9 ms
- p99: 8.8 ms
Results (cache cold)
- p50: 2.4 ms
- p95: 6.6 ms
- p99: 11.2 ms
The numbers are small because the work is small: a handful of compiled regex checks, a few vocabulary scans, and lightweight parsing.
If you want to reproduce the measurement shape, here is a runnable benchmark harness that uses a synthetic workload (so it runs anywhere):
import random
import statistics
import time
from typing import List
# assumes QueryParserAgent is in scope (from the previous code block)
def bench(parser: QueryParserAgent, queries: List[str], warmup: int = 1000) -> None:
for _ in range(warmup):
parser.parse(random.choice(queries))
times = []
for q in queries:
t0 = time.perf_counter()
parser.parse(q)
times.append((time.perf_counter() - t0) * 1000.0)
times_sorted = sorted(times)
def pct(p: float) -> float:
idx = int(p * (len(times_sorted) - 1))
return times_sorted[idx]
print(f"n={len(times)}")
print(f"p50={pct(0.50):.3f}ms p95={pct(0.95):.3f}ms p99={pct(0.99):.3f}ms")
print(f"mean={statistics.mean(times):.3f}ms stdev={statistics.pstdev(times):.3f}ms")
if __name__ == "__main__":
qp = QueryParserAgent()
base = [
"how many open incidents in dallas",
"only show critical tickets in austin",
"find escalations in nyc top 10",
"show service requests",
"only closed",
"count outages in texas",
"find change orders in san francisco",
]
# expand to simulate a bigger batch
queries = [random.choice(base) for _ in range(20000)]
bench(qp, queries)
Those benchmarks are why I’m comfortable saying: this parser lives in the “few milliseconds” regime on commodity compute, and it’s stable because it doesn’t depend on network calls.
The three real failure modes (with better structure)
When “how many” is treated as “show me”
If COUNT isn’t explicit, systems tend to overfetch: they do a full retrieval, format results, then count them. That’s wasteful and it changes the user experience.
In my plan contract, COUNT means:
- the search layer can use a count-optimized path
- the response layer can speak a number, not summarize a list
That’s not an academic distinction—voice output has a different “shape” than a UI list.
Refinements break if you don’t model FILTER
Short refinements are common:
- “only critical”
- “in Austin instead”
- “closed only”
Treating those as new searches drops conversational continuity.
The moment I promoted FILTER into the intent enum, downstream state handling got simpler:
-
SEARCHcreates a new result set -
FILTERmodifies the current result set
That is easy to test and easy to reason about.
LLM-first parsing tends to invent constraints
This is the subtle one.
When a query is underspecified (“tickets”), an LLM is incentivized to produce something that looks complete. That often means inventing filters or picking an intent that wasn’t clearly requested.
The deterministic parser does the opposite:
- it returns
SEARCHwith low confidence - it extracts nothing
- it lets the orchestrator ask a follow-up question
That behavior is boring, and boring is what you want at the front of a system.
Caching: key design, TTL, and eviction
I cache because voice traffic repeats patterns:
- users repeat themselves when they think the assistant didn’t hear them
- teams share common query templates (“how many in X”, “only critical Y”)
Cache key
My cache key is:
version + normalized_query
The version prefix is crucial. Whenever I change rules, I bump QueryParserAgent.VERSION so old cached plans don’t linger.
TTL heuristics
In production I keep TTL short (tens of seconds to a couple minutes). The objective is not “never recompute.” The objective is “avoid recomputing during bursts.”
Eviction
Two layers exist:
- a small in-process TTL cache to avoid even a Redis round-trip
- a shared cache for multi-worker setups
Eviction is intentionally simple. If the cache ever becomes a correctness risk, it’s not a cache anymore—it’s a state store, and I don’t want that.
How this differs from my router post
The earlier router piece was about minimizing orchestration latency by doing cheap routing before heavier steps.
This post is different in three concrete ways:
- Deeper internals: compiled regex rules, vocabulary design, extraction functions, confidence scoring.
- A reproducible implementation: the runnable parser and benchmark harness.
- A different boundary: the router decides which tool; the parser decides what the tool should do.
They’re siblings, not duplicates.
Closing
Once I stopped treating search as “retrieval + ranking” and started treating it as “language → plan → execution,” the whole system got calmer.
Not smarter—calmer.
The deterministic query parser removed an entire category of latency spikes and removed an entire category of conversational bugs. It also made the rest of the stack easier to build because downstream components stopped guessing what the user meant.
When the front of your pipeline is a voice assistant, that kind of boring determinism is the feature.
Top comments (0)