AI Travel Assistant Powered by Gemma 4; With Streaming, Image Input, and Visual Recommendation Cards

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Planning a trip used to mean bouncing between five browser tabs — one for flights, one for hotels, one for itineraries, one for Reddit threads, and one you forgot you opened. I wanted to collapse that into a single conversation.

Gemma Travel Assistant is an AI-powered chat app that helps you plan trips from scratch. Tell it your budget, your vibe, your dates. Ask follow-up questions. Upload a photo of somewhere you saw on Instagram and ask "where is this, and what should I do there?" It remembers everything you said earlier in the conversation and uses it to give you better answers.

What makes it feel different from a plain chatbot:

It doesn't just write paragraphs. When Gemma recommends hotels or destinations, the app parses those recommendations out of the response and renders them as visual cards — name, location, type badge (hotel / destination / restaurant), star rating, price range. You can scan five options in three seconds instead of reading five bullet points.
Responses stream token by token. You start reading the answer while Gemma is still writing it. For a full 5-day itinerary that can be 600+ words, this makes the experience feel instant instead of frozen.
It understands images natively. Drop in a photo — a landscape, a hotel lobby, a plate of food — and the model uses it as context. No extra vision pipeline, no OCR. Gemma 4 handles it directly.

Demo

Example conversation:

You: Plan a 5-day trip to Kyoto in October, budget around $1500, I love temples and local food

Gemma: Here's a day-by-day itinerary for Kyoto in October — peak foliage season, so I've planned around the best viewing spots...
(streams in, then suggestion cards appear below for ryokans and restaurants)

You: (uploads a photo of a bamboo forest)

Gemma: That's Arashiyama Bamboo Grove in western Kyoto. It's already on day 3 of your itinerary — here are the best times to visit to beat the crowds...

GitHub: https://github.com/mushahidmehdi/gemma-travel-assistant

Code

Stack:
| Layer | Choice |
|---|---|
| Framework | Next.js 16 (App Router) |
| Model | Gemma 4 31B Dense via OpenRouter |
| Styling | Tailwind CSS |
| Markdown | ReactMarkdown |
| Icons | Lucide React |

Project structure:

src/
├── app/
│   ├── api/chat/route.ts   # Streaming SSE proxy → OpenRouter
│   ├── layout.tsx
│   └── page.tsx            # Centered card layout
└── components/
    ├── ChatInterface.tsx   # Input, image upload, message list
    ├── ChatMessage.tsx     # Bubble renderer + suggestion parser
    └── SuggestionCard.tsx  # Hotel / destination / restaurant cards

How I Used Gemma 4

Choosing the model

I went with Gemma 4 31B Dense (google/gemma-4-31b-it). Here's why that specific model, not the others:

The E2B / E4B models are designed for edge and mobile — brilliant for offline use, but I needed server-grade reasoning quality for multi-day itineraries with budget constraints, visa tips, and local context. A 2B model can hallucinate confidently about things it doesn't know well.

The 26B MoE model is optimized for throughput. For a travel assistant where a single user sends a message and waits for the reply, throughput wasn't the bottleneck. Quality and coherence over a long conversation were.

The 31B Dense hits the right balance: strong enough to produce well-structured, accurate travel advice, consistent enough to reliably follow formatting instructions (more on that below), and available on OpenRouter's free tier so anyone can clone the repo and run it without a credit card.

The 128K context window was the other deciding factor. Planning a real trip is a long conversation. By the time you've discussed your budget, chosen a region, rejected two hotel options, added a day trip, and asked about visa requirements, you've accumulated thousands of tokens of context. Smaller context windows start dropping earlier constraints. With 128K, nothing gets forgotten.

Streaming the response

The API route doesn't buffer — it pipes OpenRouter's SSE stream directly to the browser:

// src/app/api/chat/route.ts
const stream = new ReadableStream({
  async start(controller) {
    const reader = response.body!.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      const chunk = decoder.decode(value);
      const lines = chunk.split('\n').filter(line => line.startsWith('data: '));

      for (const line of lines) {
        const data = line.slice(6);
        if (data === '[DONE]') { controller.close(); return; }
        try {
          const parsed = JSON.parse(data);
          const content = parsed.choices?.[0]?.delta?.content;
          if (content) controller.enqueue(new TextEncoder().encode(content));
        } catch { /* skip malformed chunks */ }
      }
    }
    controller.close();
  },
});

return new Response(stream, {
  headers: { 'Content-Type': 'text/plain; charset=utf-8' },
});

On the client, ChatInterface reads the stream chunk by chunk and appends to the last message in state, so React re-renders progressively as tokens arrive.

Structured output via prompting

I didn't use a formal structured output API. Instead, the system prompt tells Gemma to append a fenced suggestions block at the end of any response that involves specific recommendations:

When suggesting places, format your hotel/destination/restaurant recommendations
as a JSON block at the end of your response:

suggestions
[{
"name": "Nishiyama Onsen Keiunkan",
"location": "Yamanashi, Japan",
"type": "hotel",
"rating": 4.9,
"price": "$$$",
"description": "The world's oldest hotel, operating since 705 AD..."
}]

typescript

ChatMessage then does two things: strips that block from the visible text (so it doesn't appear as raw JSON in the bubble), and passes the parsed array to SuggestionCard components:

function parseSuggestions(content: string) {
  const match = content.match(/```
{% endraw %}
suggestions\n([\s\S]*?)
{% raw %}
```/);
  if (!match) return { text: content, suggestions: [] };

  const text = content.replace(/```
{% endraw %}
suggestions\n[\s\S]*?
{% raw %}
```/, '').trim();
  try {
    return { text, suggestions: JSON.parse(match[1]) };
  } catch {
    return { text: content, suggestions: [] }; // graceful fallback
  }
}

If Gemma omits the block — for a conversational reply like "Great, let's add a day trip!" — the component falls through cleanly and just shows the text bubble. No crashes, no empty card rows.

Multimodal input

Image uploads are encoded as base64 data URLs and injected into the last user message as an image_url content block — the format OpenRouter and Gemma 4 expect:

if (msg.role === 'user' && image && isLastMessage) {
  return {
    role: 'user',
    content: [
      { type: 'text', text: msg.content },
      { type: 'image_url', image_url: { url: image } }, // base64 data URL
    ],
  };
}

Gemma 4's native vision understands the image without any preprocessing on my end — no external OCR, no separate vision model call. The model sees both the image and the conversation history and responds in context.

Building this made me appreciate how much the context window size and multimodal capability change what's actually possible in a single conversation. A travel assistant that forgets what you said three messages ago, or that can't look at a photo you found, is just a fancier search box. Gemma 4 31B makes it feel like talking to someone who's actually paying attention.