DEV Community

DoubleZ
DoubleZ

Posted on

How I Built ReadMenuAI: Solving the "Poetic" Chinese Menu Problem with GenAI

Hi everyone, I'm DoubleZ! ๐Ÿ‘‹

Have you ever walked into a Chinese restaurant, faced a menu full of beautiful but confusing characters, and had no idea what to order? Or worse, ordered a dish only to find it looks nothing like what you imagined?

This is a common pain point for many expats and travelers. Traditional translation apps often fail here because Chinese menus are a uniquely difficult data source: artistic fonts, handwritten text, inconsistent layouts, and "poetic" dish names that shouldn't be translated literally (e.g., "Husband and Wife Lung Slices" โ€” ๅคซๅฆป่‚บ็‰‡).

Simple translation isn't enough; users need visual context. Thatโ€™s why I built ReadMenuAI.


๐ŸŒŸ What is ReadMenuAI?

ReadMenuAI is an AI-powered tool that helps users "see" and understand Chinese menus. You simply upload a photo, and the AI transforms it into a digital, interactive experience.

  • โœ… OCR & Extraction: Detects dish names and prices accurately.
  • ๐ŸŒ Contextual Translation: Translates names while explaining ingredients and allergens.
  • ๐Ÿ–ผ๏ธ AI Image Generation (The "Wow" Factor): Generates high-quality, representative photos for dishes that don't have pictures.
  • ๐Ÿ’ฐ Travel Utilities: Real-time currency conversion and audio pronunciation for easy ordering.

๐Ÿ› ๏ธ The Technical Deep Dive: The AI Stack

The core challenge was turning unstructured, stylized visual data into structured, meaningful information. Here is how the AI pipeline works:

1. Advanced OCR with Qwen3-Vision

I used the latest Qwen3 (Tongyi Qianwen) multimodal models. Unlike standard OCR, these models are exceptional at:

  • Recognizing handwritten or highly stylized calligraphy.
  • Maintaining the spatial relationship between a dish name and its price across messy layouts.

2. Multimodal Parsing & Semantic Translation

The extracted text is fed into an LLM (Large Language Model) to go beyond literal translation. It identifies:

  • The "Real" Meaning: Explaining that "Ants Climbing a Tree" is actually glass noodles with minced pork.
  • Dietary Specs: Automatically tagging dishes as vegetarian, spicy, or containing common allergens.

3. Visual Context via Image Generation

For menus without photos, I integrated Tongyi Wanxiang 2 (Text-to-Image). This builds immediate trust. When a user sees a generated image of the dish, the "ordering anxiety" disappears.


๐Ÿ—๏ธ The Full-Stack Architecture

As a solo developer, I needed a stack that was fast to deploy but robust enough to handle heavy AI workloads.

  • Frontend: Next.js 14 + Tailwind CSS + Shadcn UI
  • Database & Auth: Supabase
  • Caching: Upstash (Redis)
  • Storage: Cloudflare R2
  • Background Jobs: Trigger.dev (Crucial for handling long-running AI image generation)
  • Deployment: Cloudflare Workers

๐Ÿ’ก Engineering Challenge: Handling Long-Running Tasks

Image generation and deep parsing can take 10-20 secondsโ€”too long for a standard serverless function timeout.

Instead of setting up a separate heavy backend, I used Trigger.dev. It allowed me to:

  1. Offload the image generation to a background queue.
  2. Maintain a clean Next.js project structure.
  3. Provide real-time progress updates to the user via webhooks.

๐Ÿš€ From Idea to Launch

ReadMenuAI is my second major AI project. It implements the full lifecycle of a SaaS: Authentication, Credit-based billing (Stripe), and Internationalization (supporting 12 languages).

Beyond the tech, this project is about cultural connection. By removing the language barrier in restaurants, weโ€™re making local culture more accessible and "delicious" for everyone.

Try it out here: readmenuai.com

Top comments (0)