Building a Local-First Hotel Receptionist with Gemma 4, GGUF, and llama.cpp

chuanman2707 — Fri, 15 May 2026 08:58:22 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Why I Built This

I have been building CapyInn, a hotel management project for small hotels, guesthouses, and homestays in Vietnam.

The original CapyInn project started before this challenge, but the new work I focused on for this Gemma 4 challenge was the AI receptionist layer: a local-first front-desk assistant powered by Gemma 4, converted to GGUF, and served through llama.cpp.

The goal was not to build a general chatbot.

The goal was to build a bounded receptionist copilot that can help hotel staff answer common guest questions, while safely deferring anything it cannot verify.

For small hospitality businesses, this matters because late-night guest messages, check-in questions, room details, and policy questions often arrive when staff are busy or asleep. But at the same time, the assistant should not pretend to confirm payments, approve fake documents, or access private hotel systems.

Demo

What I Built

CapyInn Receptionist is a local AI front-desk copilot for small hotels in Vietnam.

It can help with:

answering room and check-in questions
drafting replies for late-night guest messages
asking follow-up questions when booking information is incomplete
explaining basic hotel policies
refusing or deferring sensitive requests to hotel staff

The most important behavior is the boundary.

If the assistant cannot verify something, it should not make it up. For example, it should not confirm a payment, accept suspicious guest documents, or expose system access. It should hand those cases back to a human.

How I Used Gemma 4

I fine-tuned and packaged a Gemma 4-based receptionist model for this hospitality workflow, then converted it into GGUF so it could run locally with llama.cpp.

The local model file I used:

capyinn-gemma-4-Q5_K_M.gguf

The model runs locally on a Mac mini with Apple M4, 10-core CPU, 16 GB unified memory, using llama.cpp with Metal/BLAS.

In my latest conservative benchmark run:

generation speed: about 29 tokens/second
prompt/prefill speed: about 511 tokens/second
cold CLI startup to first token: about 2.1 seconds
short 64-token capped response from cold startup: about 3.7 seconds
RAM allocation: about 6.0 GiB at 4K context
RAM allocation: about 7.0 GiB at 128K context
GGUF file size: 3.35 GiB
metadata context window: 131,072 tokens

That was good enough for a practical front-desk assistant on a small local machine.

Why Local AI Matters Here

A hotel receptionist assistant handles information that can be sensitive: guest names, booking details, arrival times, special requests, and sometimes payment-related questions.

For a small hotel, sending everything to a remote API may not always be ideal.

A local Gemma 4 setup gives a few practical advantages:

lower ongoing cost
better privacy posture
usable latency on consumer hardware
no dependency on cloud availability for basic replies
easier deployment for small businesses that already have an office computer

The tradeoff is that the assistant must be carefully scoped. Local does not automatically mean safe. The model still needs clear task boundaries.

The Safety Rule I Used

The main rule is simple:

If the assistant cannot verify it, it should not confirm it.

That means the assistant can draft helpful replies, but it should defer sensitive actions such as:

payment confirmation
suspicious guest documents
account or system access
policy exceptions
anything requiring staff approval

This made the demo much more realistic. A hotel AI assistant should be helpful, but it should also know when to stop.

What I Learned

The biggest lesson was that model capability is only one part of the product.

The harder part is designing the workflow around the model.

For this use case, I cared less about making the assistant sound impressive, and more about making it useful, bounded, and honest.

Gemma 4 worked well for this because it was capable enough for conversational front-desk tasks, while still small enough to run locally after quantization.

The final result is not a replacement for hotel staff. It is a copilot that can reduce repetitive work and help small hotels respond faster.

Links

GitHub:

https://github.com/chuanman2707/CapyInn

Model:

https://huggingface.co/chuanman2707/capyinn-gemma-4-e2b-it-q5-k-m-gguf

Demo video:

https://youtu.be/uYGbkv2HfHQ

DEV Community: chuanman2707