How I Recovered Weak Matches with Controlled Expansion and Bundled Evidence in a Solo PoC
TL;DR
This write-up documents a personal experiment I ran while thinking about how guests actually ask questions at the front desk, on the phone, or in a chat widget. The phrasing is rarely canonical. Someone might say they want a rubdown later today when the policy text says massage appointments and cancellation windows. Another guest might describe late checkout as staying past the morning rush. If you treat every utterance as a perfect keyword match, you will look clever in a slide deck and brittle in real language. I built a small Python system that embeds synthetic hotel policy chunks with a compact sentence transformer, measures cosine similarity against guest questions, and applies a narrow healing loop when the score falls below a floor I set by hand. The loop tries controlled synonym expansion first, then merges the top passages into a bundled evidence string when the model still hesitates. I also track synthetic staleness days on each chunk so the PoC can pretend that some documents deserve an offline review queue. Nothing here runs in production, nothing here connects to a property I have worked with, and nothing here should be read as advice from a vendor. I am describing an exploratory solo build because that is what it is. The code lives in a public repository so anyone can inspect the assumptions without asking me to narrate them from memory. If you take one idea away, take this one: I cared more about making failure visible and recoverable than about chasing the highest possible retrieval score on a toy corpus. Transparency beat vanity in my priorities for this project.
Introduction
Hospitality guest operations sit at an interesting intersection of empathy and procedure. Guests want speed and clarity. Operators want consistency and traceability. Tools in the middle often promise both and deliver neither if they hide how an answer was assembled. I have been thinking about that tension while experimenting with retrieval stacks that admit their own uncertainty. This article is my attempt to write down what I tried, what I measured informally, and where I stopped on purpose.
Before I describe the modules, I want to anchor the posture of this article. I am writing as an individual who builds experiments in public to learn, not as someone who is reporting on a deployed guest assistant or claiming validated operational outcomes. Hospitality is easy to romanticize and easy to misrepresent. I have watched people ask oddly phrased questions in lobbies and elevators, not as a researcher with formal instruments but as someone who notices wording when I travel. The questions are rarely perfect. A person might compress three concerns into one sentence about noise, timing, and fairness. If you flatten that into a single embedding and hope for the best, you can still retrieve plausible text, but you lose the story of why the match was weak in the first place. That loss matters to me as an engineer because I like systems that admit uncertainty instead of laundering it behind fluent text.
In my PoC I chose a hotel guest-operations framing because it is relatable, easy to illustrate with synthetic documents, and distinct from other domains I have written about recently in this personal series. I am not describing revenue management, banquet sales, or accounting. I am staying with a narrow slice: how a policy-grounded assistant might assemble evidence before any human writes a guest-facing sentence. I also want to be clear about the human layer. Staff use judgment. A system that pretends to replace that judgment with a single score is not something I would defend. What I am experimenting with is a structured scaffold that keeps evidence visible so a human can still override.
I wrote this article because I wanted a serious project that still fits on a laptop. I have seen enough demos where a model produces fluent language and hides the underlying evidence. I wanted the opposite. I wanted logs that read like engineering notes, not marketing copy. The code prints healing actions such as none, synonym_expand, or context_merge, and it writes a small matplotlib chart so I can see whether the batch run skewed toward one outcome because of a bug or because of the wording of my synthetic questions.
There is another motivation I should state plainly. I am interested in practices that survive contact with messy language. People shorten words, omit nouns, and rely on context. They say the morning rush thing instead of spelling out late checkout. Any retrieval system that assumes the query already contains canonical terms will fail in ways that look embarrassing on a demo but painful in real life. I did not solve that fully here. I only created a place to talk about it honestly while still writing code.
I also want readers to know the scope boundary I used while writing. This article discusses a synthetic dataset and illustrative thresholds. It does not describe any real hotel brand, franchise agreement, or property staffing model. If a phrase resembles language you have seen in the wild, that is because operational writing converges on similar vocabulary, not because I copied private material.
A note on language and tone
I chose neutral, procedural wording for the synthetic policies on purpose. I did not want sensational examples that read like a thriller. Real front-desk life already carries enough stress without my demo adding theatrical conflict. I also avoided idioms that only make sense in one region. The point was to keep the text boring enough that retrieval mistakes are visible instead of being masked by narrative drama.
How this article relates to my other experiments
I have written about routing and retrieval in other contexts. This piece is different because the healing loop is the protagonist. I am not showcasing a multi-agent cast. I am showcasing a measurement-and-repair cycle that could exist inside many larger systems. If you have read my earlier write-ups, you might recognize my preference for logs over slogans. That preference shows up again here in how I print healing actions and scores.
What's This Article About?
- The article walks through GuestResilience-HotelContext-AI, a Python project that embeds short policy chunks, scores guest questions with cosine similarity, and applies a healing loop when confidence falls below a configurable floor.
- I explain why I combined semantic retrieval with a hand-built synonym map rather than relying on either signal alone in isolation.
- I show how the batch table and matplotlib chart help me see whether the demo skews toward synonym expansion because of the synonym list or because of the embedding geometry on a tiny corpus.
- I discuss limitations honestly: miniature corpora, heuristic floors, and a merge score that is not a calibrated probability.
- I include a code walkthrough that mirrors how I read the repository myself when I return to it after a gap.
Tech Stack
Runtime expectations on a laptop
I tested this PoC on a recent Mac laptop with a normal consumer CPU. Inference time for a batch of half a dozen questions is small enough that I did not bother printing millisecond timings in the CLI. If you run on older hardware, the first embedding pass over the chunks might take longer, but it remains a one-time cost per process start. I mention hardware because retrieval demos often silently assume a GPU. I did not require a GPU for this code path.
The implementation is intentionally straightforward on purpose. I rely on Python 3.10 or newer, NumPy, scikit-learn largely as a transitive dependency, sentence-transformers with the all-MiniLM-L6-v2 model for normalized embeddings, matplotlib for a bar chart, and Rich for readable terminal tables. There is no hosted vector database and no cloud requirement for the retrieval math itself. The entire index fits in memory because I refused to pretend this PoC is big data.
From where I stand, that stack is enough to demonstrate the idea that resilience in the small can be practiced with transparent steps when the corpus is tiny and the goal is structured evidence rather than open-ended generation. If I later swap MiniLM for another encoder, the interfaces around chunking and healing remain stable, which was a design goal while I sketched the modules.
Why Read It?
- If you are evaluating how to structure pre-model logic for operational assistants, this article offers a concrete pattern: measure confidence, attempt a deterministic repair, then widen evidence before you give up.
- If you are learning sentence-transformers with cosine similarity, the retrieval module is short and testable.
- If you care about reproducibility, the orchestrator gives you a baseline against which any future learned rewriter can be compared.
- I think the read is most useful for practitioners who want a middle ground between pure neural retrieval and pure rules, because the code shows exactly where those worlds meet in my PoC.
There is also a pedagogical angle I care about. Many tutorials jump straight to large language models for every turn without establishing a measurement story. I am not anti-LLM; I use them elsewhere. But I believe beginners should see cosine similarity on explicit vectors at least once, because it demystifies what nearest neighbor means in code rather than in marketing language.
Finally, if you maintain open-source examples, you know the burden of dependencies. I kept the stack bounded so a reader in a constrained environment can still run the demo after accepting the one-time model download.
Let's Design
Framing the problem without overfitting the story
Before touching code, I spent time writing short synthetic guest questions on paper. I noticed recurring patterns: some messages emphasize time pressure early, others bury the actionable detail in the second half, and a few mix wellness language with policy language. I did not try to split multi-intent questions into multiple tickets in this repository. Instead, I focused on a single-text input so the healing loop stays easy to reason about. That choice trades realism for clarity, and I am comfortable stating that upfront.
Why semantic similarity plus a synonym map
The design starts from an observation I kept returning to while prototyping: informal words are not interchangeable with policy words, yet they often co-occur in real speech. A guest might say rubdown while the document says massage. The embedding model sometimes closes that gap on its own, and sometimes it does not. I did not want a black box rewrite of the query. I wanted a controlled expansion list that I can audit, prune, and argue about in a code review with myself.
The retrieval layer builds a normalized embedding matrix for every chunk. For each query, I compute cosine similarity as a dot product because the vectors are unit length. I take the top five for debugging, but the orchestration decision hinges on the best score versus a floor constant defined in RetrievalConfig.
The healing step as a confidence gate
I use the word healing in a narrow sense. There is no autonomous agent calling external tools without bounds. I am referring to a decision step that tries synonym expansion when the first pass looks weak, then merges the top passages when the expanded query still fails to clear the floor. That is not deep reasoning. It is a guardrail with two rungs. I still call it healing in the sense that the system attempts to repair a weak match before it declares the evidence unreliable.
If I were to extend this experiment, I would log the floor crossings and measure how often expansion helps relative to merge. In this PoC, I only observe the behavior in the console and in the chart.
Observability as a first-class requirement
I insisted on ASCII-friendly batch output because I wanted copy-pasteable logs for my own notes. Rich tables are not strictly necessary, but they make the first screen readable when I am tired. The matplotlib chart is a concession to the human visual system. Even a simple bar chart changes how I perceive imbalance across healing actions.
Ethics and guest-facing tone
I thought carefully about how synthetic hospitality language can still carry real emotional weight for readers. I avoided sensational scenarios. I kept policies dull on purpose because dull policy text is what operational systems ingest. I also avoided implying that this PoC could triage safety-critical incidents. It cannot. It is a toy corpus with toy thresholds.
Let's Get Cooking
The public repository is here: https://github.com/aniket-work/GuestResilience-HotelContext-AI
I will highlight three slices of the code that capture the spirit of the build: configuration boundaries, the healing orchestration, and the batch entry point.
Configuration as a contract with future me
I centralized thresholds and the synonym map in one module so I cannot pretend magic numbers appeared from nowhere. The floor is a single float. The synonym map is a dictionary from informal triggers to extra tokens that nudge the embedding toward policy vocabulary.
from dataclasses import dataclass
@dataclass(frozen=True)
class RetrievalConfig:
"""Tunable thresholds for the PoC; not tuned on production traffic."""
similarity_floor: float = 0.45
staleness_warning_days: int = 45
model_name: str = "sentence-transformers/all-MiniLM-L6-v2"
SYNONYM_EXPANSIONS: dict[str, tuple[str, ...]] = {
"pool": ("swimming", "aquatics", "lap pool", "towels"),
"spa": ("massage", "wellness", "studio", "appointment"),
"rubdown": ("massage", "wellness", "studio", "appointment", "cancellation"),
"checkout": ("late checkout", "11:00", "2:00", "availability"),
"morning": ("late checkout", "11:00", "2:00", "availability", "rush"),
"loud": ("quiet hours", "complaints", "noise", "neighbors"),
"tesla": ("EV", "charging", "garage", "kilowatt", "overnight"),
"clean": ("housekeeping", "towels", "privacy mode", "tablet", "nights"),
}
I wrote the dataclass as frozen because I wanted the configuration object to behave like a value I pass around without accidental mutation during a late-night edit. In my opinion, small immutability choices reduce self-inflicted bugs in solo projects too. The synonym map is deliberately limited. I did not try to learn it from data in this repository because I wanted an honest baseline I could explain without pointing at a training pipeline I do not own.
Healing orchestration: measure, expand, merge
The orchestration function encodes the query, compares against the floor, optionally expands the query text when informal keywords appear, and finally merges the top passages if the system still cannot climb above the floor. The merge path constructs a synthetic chunk identifier so the log shows that the evidence is bundled rather than a single canonical policy paragraph.
def self_healing_retrieve(index, raw_query, cfg):
query_used = raw_query
qvec = encode_query(raw_query, cfg)
ranked = index.top_k(qvec, k=5)
best_chunk, best_score = ranked[0]
if best_score >= cfg.similarity_floor:
return HealingResult(
query_used=query_used,
best_chunk=best_chunk,
best_score=best_score,
healing_action="none",
detail="Retrieval above similarity floor.",
)
expanded = _expand_query(raw_query)
if expanded is not None:
evec = encode_query(expanded, cfg)
ranked_e = index.top_k(evec, k=5)
ec, es = ranked_e[0]
if es > best_score:
best_chunk, best_score, query_used = ec, es, expanded
if best_score >= cfg.similarity_floor:
return HealingResult(
query_used=query_used,
best_chunk=best_chunk,
best_score=best_score,
healing_action="synonym_expand",
detail="Synonym expansion recovered a confident match.",
)
ranked_for_merge = ranked_e
else:
ranked_for_merge = ranked
merged_chunk, merged_score = _merge_context(ranked_for_merge)
if merged_score >= best_score:
return HealingResult(
query_used=query_used,
best_chunk=merged_chunk,
best_score=merged_score,
healing_action="context_merge",
detail="Merged top passages after low single-chunk confidence.",
)
return HealingResult(
query_used=query_used,
best_chunk=best_chunk,
best_score=best_score,
healing_action="staleness_flag",
detail="Healing did not lift score above prior best; flagged for offline refresh.",
)
When I read this function cold, I look for three things: whether I accidentally reuse the wrong ranked list after expansion, whether the merge path can invent a score that looks comparable to cosine similarity, and whether the failure mode still returns something inspectable. The merge score is an average of the top similarities. That is not a probability. It is a crude signal so the PoC can prefer a wider bundle over a single weak chunk.
Entry point: batch questions and a chart
The main module loads chunks, embeds them once, runs the first query with a detailed table, then iterates a batch list and records healing actions for plotting.
def main() -> None:
console = Console(width=118)
cfg = RetrievalConfig()
chunks = load_chunks(ROOT)
texts = [c.text for c in chunks]
matrix = encode_texts(texts, cfg)
index = ChunkIndex(chunks, matrix)
first = _demo_queries()[0]
res0 = self_healing_retrieve(index, first[2], cfg)
print_single_result(console, res0)
rows: list[tuple[str, str, float, str]] = []
actions: list[str] = []
for qid, label, text in _demo_queries():
r = self_healing_retrieve(index, text, cfg)
actions.append(r.healing_action)
rows.append((qid, r.healing_action, r.best_score, label))
print_batch_ascii(console, rows)
out_png = ROOT / "output" / "healing_actions.png"
plot_healing_actions(actions, out_png)
I structured the demo queries to span direct pool language, slang wellness language, vague checkout language, noisy neighbor language, EV parking language, and opaque housekeeping language. In my experience, that spread is enough to stress the synonym map without pretending the corpus is comprehensive.
What I rejected along the way
I considered a few alternatives before settling on MiniLM plus a manual synonym map for the first public cut. A cross-encoder reranker would likely improve ordering on ambiguous pairs, but it would also double the inference story and tempt me to hide mistakes behind a second model without a clean measurement layer. I decided that demonstrating a two-stage neural stack was not the point of this repository. The point was to show a transparent loop.
I also thought about BM25 as a lexical backstop. It is a strong baseline for short documents and behaves well when the vocabulary overlap is explicit. On a handful of ten-chunk snippets, the difference between BM25 and TF-IDF style signals is often swamped by the fact that the corpus is tiny. I stayed with dense embeddings because the guest language problem I care about is often semantic drift rather than spelling. I can imagine a hybrid later: dense retrieval for the first pass, BM25 for dispute resolution when two chunks tie.
Vector store and why I kept it in memory
The ChunkIndex class is intentionally boring. It stores a matrix of normalized embeddings and a parallel list of PolicyChunk objects. Search is a matrix-vector product followed by sorting. I did not use an approximate nearest neighbor index because the row count is ten. Bringing in HNSW or IVF would be theater. I would rather write code that a reader can grep in a single file than pretend this PoC needs a billion-scale index.
Staleness as a narrative device
Each synthetic chunk carries a staleness_days integer. I use it as a narrative device in the detail string when a chunk crosses a warning threshold. I am not running a real document management system. I am simulating the feeling of operations where a PDF might have been updated last quarter while the embedding still reflects old text. If I ever wire this to a real ingestion pipeline, staleness should come from a database, not a JSON field I edited by hand.
Theory in plain language: what cosine similarity is doing here
When I say cosine similarity, I mean the dot product between two unit vectors. The sentence-transformers library can emit normalized embeddings, which turns cosine similarity into a single dot product without a separate magnitude step. That is convenient, but it also means I am trusting the encoder to place paraphrases near each other in angular space. On small corpora, the geometry can be surprisingly sharp. Two chunks that look similar to me might sit far apart because the model latched onto different function words.
I spent time thinking about what a score of 0.45 means versus 0.65. In a calibrated probabilistic system, those numbers would come with a story about calibration. Here, they are relative ranks with a threshold I set manually. I want to be explicit about that because it is easy to reify cosine scores as confidence when they are not. In my PoC, the score is a compass needle, not a verdict.
Edge cases that kept me honest
- If a guest uses a phrase that matches multiple synonym keys, the expansion step concatenates extra tokens. That can help or hurt. More tokens add noise. I mitigated this by keeping the synonym tuples short and topic-aligned.
- If the first retrieval is already above the floor, I do not attempt healing. That is intentional. I did not want a system that always second-guesses a strong match.
- If expansion fails and merge still looks weak, I fall back to staleness_flag. That label is intentionally unsatisfying. It is a reminder that some queries need a human or a richer corpus, not another heuristic.
Personal workflow notes from building solo
I kept a paper notebook beside the keyboard while I wrote the synthetic questions. That sounds quaint, but it slowed me down in a useful way. When I type questions directly into code, I optimize for short strings. When I write them on paper, I leave in awkwardness. Awkwardness is the point. I also tracked my own confusion: if I could not remember why a threshold existed a week later, I renamed a variable or added a comment. I am not claiming perfect documentation. I am claiming that solo work still benefits from a future reader, and that future reader is often me.
Let's Setup
Step-by-step details can be found in the repository README. At a high level, I create a virtual environment inside the project directory, install requirements, and run python main.py. The first execution downloads the sentence-transformers weights, which is the longest step. I prefer keeping the virtual environment local to the project so the PoC stays self-contained when I archive it months later.
Deeper code walkthrough: embedder and index
The embedder module memoizes the SentenceTransformer model so repeated runs do not reload weights. I encode all chunk texts once at startup, then reuse the matrix for every query. That is standard batching discipline, but it matters when I iterate on questions because the expensive work stays amortized.
The vector store computes similarity as a dot product between the query vector and each row of the matrix. I take the top five for debugging even though decisions only need the top one. I do that because I want to inspect near-misses when a score looks wrong. In my experience, the second-best chunk often explains why the first-best chunk is misleading.
Roadmap I would pursue if this stayed a hobby
- Add a small evaluation harness with paraphrased questions and a simple precision-at-one metric.
- Swap the manual synonym map for a learned sparse expansion that I can still audit, or for a curated ontology from a domain I own.
- Introduce an explicit human-approval flag in the output for any evidence bundle that includes merged chunks.
- Explore a lightweight reranker only after the measurement harness exists, because I refuse to stack models without a baseline.
Reflections on reliability and guest trust
Reliability is not only a technical score. It is also the feeling a staff member gets when they read the evidence. If the evidence is verbose, contradictory, or obviously stitched, trust drops even when a cosine score is high. I thought about that while designing the merge path. Bundling top passages is a blunt instrument. It increases recall at the cost of readability. In a production setting, I would want a summarization step that cites chunk identifiers, and I would want those identifiers to map back to a source document version. None of that exists here. I am describing what I would do next, not what I shipped.
Let's Run
When the script finishes, I expect a Rich table for the first query, an ASCII batch summary, and a matplotlib file under output/healing_actions.png. I treat that chart as a sanity check. If every bar lands in one category, I suspect a bug or a threshold that is too aggressive.
I usually run the script twice in a row during development. The first run pays the model download cost if the cache is cold. The second run is the one I use to compare output after a code change, because it removes network noise from the picture. That habit saved me from chasing ghosts more than once when I thought my logic changed scores but the real difference was initialization time.
What I would measure if I turned this into a longer study
I am not running a formal benchmark in this repository. I want to be explicit about that gap because benchmarks are where retrieval claims go to become honest or fall apart. If I had another month of evenings, I would build a small labeled set of question-and-chunk pairs derived from the same synthetic corpus, then sweep the similarity floor and record how often expansion or merge changes the top chunk. I would also measure latency on a cold start versus a warm start, because the sentence-transformers download is the kind of friction that changes whether a demo feels credible in a conference room with spotty Wi-Fi.
I would also track how often merged bundles confuse a human reader. That is a qualitative metric, but it matters. A merge that improves cosine similarity but produces a wall of text is not a win if the goal is staff confidence. In my opinion, human readability should be a first-class metric alongside rank metrics, even if it is harder to automate.
Why I did not ship a full chatbot wrapper
A full chatbot would need session management, safety filters, and a clear escalation path. Those layers are important, but they would dilute the retrieval story I wanted to tell. I kept the surface area small on purpose. The CLI prints evidence. That is enough for me to judge whether the retrieval layer is behaving, and it is enough for a reader to fork the repository without inheriting a web stack I do not want to maintain as a solo author.
Dependencies, downloads, and the social contract of open weights
I rely on publicly available weights. That choice carries a social contract: read the license, respect attribution, and do not pretend the model is neutral truth. I also accept the reality that first-time downloads can fail for reasons outside my code. I mention this because newcomers sometimes blame their own competence when the network hiccups during a Hugging Face download. If that happened to you while reproducing my PoC, retrying with a stable connection usually fixes it. If it persists, mirror the weights locally and point the configuration at your mirror. I did not bake a mirror into the repository because I did not want to privilege a single hosting strategy.
Narrative distance from production
I keep repeating that this is experimental because I want the distance to be obvious. Production systems have change control, incident response, and accountability chains I am not simulating. When I say staleness_flag, I am not claiming an operational incident ticket exists. I am labeling a branch in my code. That distinction matters if someone reads this article quickly and assumes they can paste the repository into a live environment without additional work.
Closing Thoughts
What I learned about my own habits
I noticed that I reached for matplotlib faster than I reached for unit tests in the first week. That is not a brag. It is a confession. Charts feel like progress. Tests feel like discipline. In a longer project I would add tests around the synonym expansion function and the merge path because those are the places where silent bugs hide. For this PoC, I relied on manual inspection and repeated runs. I am documenting that choice because I want readers who clone the repository to know where rigor ends and storytelling begins in my own process.
A word on naming
I named the repository GuestResilience-HotelContext-AI because I wanted the words to sound operational without sounding like a product SKU. Names matter when you revisit a folder six months later. I have abandoned enough cleverly named experiments to appreciate boring clarity.
I started this experiment because I wanted a personal answer to a simple question: what does resilience mean when the model is small, the corpus is synthetic, and the user language is sloppy? My answer, for now, is that resilience looks like measurement first, bounded repairs second, and honest failure labels third. I do not think that answer is universal. I think it is a reasonable discipline for a PoC that might otherwise collapse into storytelling.
If I revisit the project, the first upgrade I would consider is a principled evaluation split: hold out chunks, paraphrase questions, and quantify how often expansion helps versus hurts. The second upgrade would be a real staleness pipeline, not a numeric field I typed by hand. The third upgrade would be an explicit separation between guest-facing summarization and evidence retrieval, even if both use models, because commingling them erodes auditability.
If nothing else, I hope this write-up convinces you that resilience can be practiced as a discipline even when the dataset is synthetic. The point is not to win a leaderboard. The point is to build a habit of measuring, repairing, and labeling failure without embarrassment.
I also want to leave you with a caution I apply to my own demos. Hospitality language intersects with accessibility, safety, and fairness. A retrieval stack that looks clever on a developer laptop can still be wrong in ways that matter. I wrote this as an experimental article precisely because I want room to be humble about those limits.
Tags: python, rag, machinelearning, hospitality
Thank you for reading this far. I know it is a long piece. I wrote it at this length because I wanted the reasoning trail to be inspectable, not because I enjoy typing for its own sake.
Disclaimer
The views and opinions expressed here are solely my own and do not represent the views, positions, or opinions of my employer or any organization I am affiliated with. The content is based on my personal experience and experimentation and may be incomplete or incorrect. Any errors or misinterpretations are unintentional, and I apologize in advance if any statements are misunderstood or misrepresented.





Top comments (0)