Thirupathi Pyati

Posted on Feb 22

Building a Global AI Design Studio: RAG, Vector Search, and the Physics of Localization

#webdev #javascript #react #ai

The Hidden Cost of Localization

If you’ve ever built a global product, you know the "German Word" problem. You design a beautiful, pixel-perfect UI with a crisp button that says "Sale" (4 letters). Then, you localize your app for Germany, and "Sale" becomes "Sommerschlussverkauf" (20 letters). Your UI shatters.

In the world of automated social media generation, this problem is magnified. True localization isn’t just about translating words; it’s about Language + Design Physics.

To solve this, I built a Design Studio. The core philosophy of this project is to completely decouple linguistic translation (powered by Lingo.dev) from visual assembly (handled by a dynamic React Canvas). Here is how the architecture comes together.

System Architecture: The Bird's-Eye View

The system operates in two distinct phases: generating the master English design using AI and Retrieval-Augmented Generation (RAG), and then fanning out that design globally using a localization engine.

PART 1: The Core Generation Engine

Real Marketing Needs Real Images: The RAG Approach

When building an automated content engine, the first instinct is to use image generators like Midjourney or DALL-E. However, this is fundamentally flawed for real businesses. A car dealership needs to sell their actual inventory; a resort needs to showcase their actual pool. AI hallucinations don't convert.

Instead, we use a RAG (Retrieval-Augmented Generation) approach. We prompt Gemini 2.5 Flash to act as an art director. It takes the user's goal, brainstorms three distinct "angles," and extracts a highly specific visual_search_term.

// From geminiService.js
const generateConcepts = async (userPrompt, platform, tone) => {
  const prompt = `
    USER PROMPT: "${userPrompt}"
    TARGET PLATFORM: ${platform}
    TARGET TONE: ${tone}

    TASK: Brainstorm 3 distinct "content angles" or visual concepts for this post.
    OUTPUT JSON ARRAY format:
    [
      { 
        "concept_title": "Short Title", 
        "visual_search_term": "A simple keyword phrase to find a photo" 
      }
    ]
  `;
  // ... Gemini API call ...
};

Finding the Perfect Image: Vector Search

Once we have our visual search terms (e.g., "Silver Dodge Challenger desert sunset"), we need to match them against the client's actual media library.

We do this by passing the term through the gemini-embedding-001 model to create a high-dimensional vector. We then use Cosine Similarity to compare this prompt vector against the pre-computed embeddings of our image pool stored in MongoDB.

// From sessionRoute.js
const findBestImage = async (searchQuery, excludePath = null) => {
  const promptVector = await getEmbedding(searchQuery);
  const allImages = await clientImagePool.find().select('filePath embedding description');

  let scoredImages = allImages.map(img => ({
    filePath: img.filePath,
    score: similarity(promptVector, img.embedding)
  }));

  // Sort by highest cosine similarity
  scoredImages.sort((a, b) => b.score - a.score);
  return scoredImages.length > 0 ? scoredImages[0].filePath : null;
};

Designing the Post: The Multimodal AI Art Director

Now we have our conceptual angle and our perfectly matched image from the database. But dropping text on an image programmatically is notoriously difficult. How do we know where to put the text so it doesn't cover the main subject (like the face of a person or the grill of a car)? How do we pick a text color that has enough contrast against the background?

Instead of writing complex, brittle image-processing algorithms, we use Gemini 2.5 Flash as a multimodal art director. We send the model both the prompt and the actual image bytes, asking it to "look" at the photo and make design decisions.

Here is the exact code that powers this generation:

// From geminiService.js
const fs = require("fs");

// Helper function to convert the image file for Gemini's multimodal input
function fileToPart(path, mimeType) {
  return {
    inlineData: {
      data: Buffer.from(fs.readFileSync(path)).toString("base64"),
      mimeType
    }
  };
}

const generateSingleDesignPackage = async (imagePath, concept, platform, tone) => {
  // Injecting platform-specific constraints
  const platformRules = {
    "Instagram": "Visual focus, engaging caption, 15-20 hashtags.",
    "LinkedIn": "Professional insights, structured caption, 3-5 hashtags.",
    "Twitter": "Short, punchy, <280 chars, 1-2 hashtags.",
    "Facebook": "Conversational, community-focused. 3-5 hashtags."
  };

  const prompt = `
    CONTEXT: Creating a social media post for ${platform}.
    TONE: ${tone}
    CONCEPT: ${concept.concept_title} ("${concept.visual_search_term}")
    PLATFORM RULES: ${platformRules[platform] || "Standard social media post"}

    TASK: 
    1. Analyze the image.
    2. Write a caption matching the concept and tone.
    3. Generate hashtags.
    4. Design ONE text overlay (font, color, position) that fits this specific image.

    OUTPUT JSON schema:
    {
      "caption": "string",
      "hashtags": ["string"],
      "suggested_text": "string (short overlay text)",
      "font_family": "string (Google Font)",
      "text_color_hex": "string (hex code)",
      "font_size_score": number (1-10),
      "bounding_box": { "ymin": number, "xmin": number, "ymax": number, "xmax": number },
      "reasoning": "string"
    }
  `;

  try {
    const response = await ai.models.generateContent({
      model: "gemini-2.5-flash",
      contents: [
        { text: prompt },
        fileToPart(imagePath, "image/png") // Passing the image alongside the text
      ],
      // Enforcing strict JSON output
      config: { responseMimeType: "application/json" } 
    });

    let rawText = response.text || "{}";
    rawText = rawText.replace(/```
{% endraw %}
json|
{% raw %}
```/g, "").trim();
    return JSON.parse(rawText);
  } catch (error) {
    console.error("Design Gen Error:", error);
    return null;
  }
};

There are three critical things happening in this function:

Dynamic Context Injection: We don't just ask for a "social media post." We inject platformRules so the AI knows Twitter requires short text and few hashtags, while Instagram demands a visual focus and heavy hashtag use.
Multimodal Vision: By passing fileToPart(imagePath, "image/png") in the contents array alongside our text prompt, Gemini actually "sees" the composition. It analyzes the negative space to decide where the bounding_box should go, and looks at the image's palette to pick a contrasting text_color_hex.
Strict JSON Enforcement: By setting responseMimeType: "application/json" and defining a hardcoded schema in the prompt, we force the AI to return structured data instead of conversational text.

From JSON to React Canvas

To make this work, the frontend needs to be a deterministic rendering engine. We use a combination of React hooks and the react-rnd (Resizable and Draggable) library.

When the JSON arrives, the VariationCanvas.jsx component first dynamically injects the requested Google Font into the document head. Then, it waits for the background image to fully load so it can calculate the exact pixel dimensions of the user's specific screen or container.

Here is how we translate the AI's relative percentages into interactive, absolute pixels:

// From VariationCanvas.jsx
import React, { useRef, useState, useEffect } from 'react';
import { Rnd } from 'react-rnd';

function VariationCanvas({ imageUrl, data }) {
    // 1. Initialize State with AI JSON Data
    const [text, setText] = useState(data.suggested_text);
    const [color, setColor] = useState(data.text_color_hex);
    const [fontFamily, setFontFamily] = useState(getCleanFontName(data.font_family));

    // Default box state before image loads
    const [boxState, setBoxState] = useState({ x: 0, y: 0, width: 200, height: 100 });
    const containerRef = useRef(null);

    // 2. Map Relative AI Coordinates to Absolute Pixels
    useEffect(() => {
        if (containerRef.current && isImageLoaded) {
            const { offsetWidth, offsetHeight } = containerRef.current;

            // Extract AI layout rules
            let { ymin, xmin, ymax, xmax } = data.bounding_box;

            // Helper to ensure values are treated as percentages
            const normalize = (val) => (val <= 1 && val > 0) ? val * 100 : val;

            const xPct = normalize(xmin);
            const yPct = normalize(ymin);
            const wPct = normalize(xmax - xmin);
            const hPct = normalize(ymax - ymin);

            // Convert percentages to exact pixel values based on the dynamic container
            setBoxState({
                x: (xPct / 100) * offsetWidth,
                y: (yPct / 100) * offsetHeight,
                width: (wPct / 100) * offsetWidth,
                height: (hPct / 100) * offsetHeight
            });
        }
    }, [data, isImageLoaded]);

    return (
        <div ref={containerRef} className="relative flex-1 bg-black overflow-hidden group">
            <img src={imageUrl} onLoad={() => setIsImageLoaded(true)} className="w-full h-full object-contain" />

            {isImageLoaded && (
                // 3. Render the Draggable/Resizable Component
                <Rnd
                    size={{ width: boxState.width, height: boxState.height }}
                    position={{ x: boxState.x, y: boxState.y }}
                    bounds="parent"
                    onDragStop={(e, d) => setBoxState(prev => ({ ...prev, x: d.x, y: d.y }))}
                    onResizeStop={(e, dir, ref, delta, pos) => setBoxState({ 
                        width: parseInt(ref.style.width), 
                        height: parseInt(ref.style.height), 
                        ...pos 
                    })}
                >
                    {/* 4. Apply AI Styling to the Editable Text */}
                    <textarea 
                        value={text} 
                        onChange={(e) => setText(e.target.value)}
                        style={{
                            color: color, 
                            fontFamily: `"${fontFamily}", sans-serif`,
                            fontSize: `${fontSize}px`,
                            fontWeight: 'bold'
                        }}
                        className="w-full h-full bg-transparent resize-none outline-none"
                    />
                </Rnd>
            )}
        </div>
    );
}

Why This Architecture Matters

True Responsiveness: If the user opens this app on a mobile phone vs. a 4K desktop monitor, the text doesn't break. The useEffect recalculates the absolute pixels based on the offsetWidth and offsetHeight of the wrapper div, ensuring the text is always exactly where the AI intended it to be.
User Agency: Because the text is rendered in an HTML <textarea> inside an <Rnd> wrapper, the user can click in, fix a typo, drag the box a few pixels to the left, or grab the corner to resize it. The AI does 95% of the design work, but the human retains 100% of the control.

Frictionless Editing: Debounced Auto-Save & Atomic Updates

Even with the best AI layout rules, users will inevitably want to tweak the design dragging a text box a few pixels to the left or changing a brand color. To make this feel like a native desktop app, the studio features a friction-less, auto-saving canvas.

The Frontend: Debouncing the Canvas

Every time a user resizes the text or drags the bounding box, we don't want to bombard the database with hundreds of API calls. Instead, VariationCanvas.jsx uses a custom React useEffect hook acting as a 1000ms debounce timer.

Crucially, before sending the data to the backend, the frontend translates the absolute pixel values on the screen back into relative percentages (xmin, ymin, xmax, ymax) based on the current container dimensions. This ensures the design remains perfectly responsive across different devices.

// From VariationCanvas.jsx
useEffect(() => {
    // Only trigger if changes occurred and the image is fully loaded
    if (!hasChanged || !isImageLoaded || !sessionId) return;

    const timerId = setTimeout(async () => {
        const { offsetWidth, offsetHeight } = containerRef.current;

        // Convert absolute screen pixels back to relative percentages
        const xmin = (boxState.x / offsetWidth) * 100;
        const ymin = (boxState.y / offsetHeight) * 100;
        const xmax = ((boxState.x + boxState.width) / offsetWidth) * 100;
        const ymax = ((boxState.y + boxState.height) / offsetHeight) * 100;

        const payload = {
            sessionId,
            optionId: parentOptionId,
            langCode: currentLangCode,
            boundingBox: { xmin, ymin, xmax, ymax },
            color: color,
            fontSize: fontSize
        };

        await axios.patch(`${apiBase}/sessions/update-layout`, payload);
        setHasChanged(false); // Reset tracking state
    }, 1000); 

    return () => clearTimeout(timerId); // Clear timer on rapid user interactions
}, [boxState, color, fontSize, hasChanged /* ... dependencies */]);

The Backend: Atomic MongoDB Updates

When that payload hits the /update-layout route, we face a database challenge: our session history is a nested array (History -> Variations).

Fetching the entire document, updating one coordinate, and saving it back creates race conditions. Instead, we use MongoDB's $set operator combined with arrayFilters. This allows us to perform a surgical, atomic update deep within the document without touching the rest of the data structure.

// From sessionRoute.js
router.patch('/update-layout', async (req, res) => {
    const { sessionId, optionId, langCode, boundingBox, color, fontSize } = req.body;
    const updateFields = {};

    // Target specific array elements using aliases [h] and [v]
    const arrayFilters = [
      { "h.variations.option_id": optionId },
      { "v.option_id": optionId }
    ];

    if (boundingBox) updateFields["history.$[h].variations.$[v].bounding_box"] = boundingBox;
    if (color) updateFields["history.$[h].variations.$[v].text_color_hex"] = color;
    if (fontSize) updateFields["history.$[h].variations.$[v].font_size_score"] = fontSize;

    // Execute the surgical update
    await SessionData.updateOne(
      { sessionId: sessionId },
      { $set: updateFields },
      { arrayFilters: arrayFilters }
    );
    res.json({ success: true });
});

PART 2: The Localization Engine

The Pivot: Scaling to the World

At this point, we have a perfectly generated, editable English design. But what happens when the user clicks "Globalize"? This is where the physics-aware architecture proves its worth.

Batch Translation with Lingo.dev

To handle the heavy lifting of multi-language translation, we integrate the official @lingo.dev/sdk. Lingo.dev is incredibly fast and built specifically for developer workflows.

We can pass our entire content payload the overlay text, the social caption, and the hashtags into a single, parallelized batch process.

// From lingoService.js
const translateFullPackage = async (designObject, targetLangs = ['es', 'de', 'fr', 'ja', 'ar', 'hi']) => {
    const contentToTranslate = {
        text: designObject.text,
        caption: designObject.caption,
        hashtags: designObject.hashtags.join(', ')
    };

    const translationPromises = targetLangs.map(async (code) => {
        // Utilize fast: true for snappy UI response
        const translatedObj = await lingoDotDev.localizeObject(contentToTranslate, {
            sourceLocale: "en",
            targetLocale: code,
            fast: true 
        });
        // ... return mapped object ...
    });

    return await Promise.all(translationPromises);
};

Engineering the Auto-Fit Layout Engine (The "Secret Sauce")

When the localized text returns, it is often significantly longer or shorter than the English original. If we just dropped it into the original bounding box, it would clip or overflow.

Our layout engine intercepts this in VariationCanvas.jsx. By utilizing the HTML5 Canvas context, we can measure the exact pixel width of the newly translated words before they render. If a word like "Sommerschlussverkauf" is too wide for the box, the algorithm dynamically expands the bounding box to ensure the text fits perfectly without breaking the visual hierarchy.

// From VariationCanvas.jsx
useEffect(() => {
    // ... setup ...
    const context = document.createElement("canvas").getContext("2d");
    context.font = `bold ${fontSize}px "${fontFamily}", sans-serif`;

    const words = text.split(/\s+/);
    let maxWordWidth = 0;

    // Measure the physical width of the translated words
    words.forEach(word => {
        const metrics = context.measureText(word);
        if (metrics.width > maxWordWidth) maxWordWidth = metrics.width;
    });

    const requiredWidth = Math.min(Math.ceil(maxWordWidth + 20), containerWidth);

    // Dynamically expand the box width if the translation demands it
    if (requiredWidth > boxState.width) {
        setBoxState(prev => ({ ...prev, width: requiredWidth }));
        return;
    }

    // Expand height if text wraps and exceeds current box
    if (element.scrollHeight > element.clientHeight) {
        setBoxState(prev => ({ ...prev, height: element.scrollHeight + 10 }));
    }
}, [text, fontSize, fontFamily]);

Handling Global Typography & RTL Physics

A standard "Inter" or "Arial" font will render Japanese characters as empty boxes or severely mangle Arabic script.

To solve this, the backend maps specific language codes to culturally appropriate Google Fonts (like Noto Sans JP or Noto Kufi Arabic). On the frontend, we inject React inline styles that instantly flip the canvas physics for Right-to-Left languages.

JavaScript

// From VariationCanvas.jsx
<textarea 
    style={{
        direction: isRTL ? 'rtl' : 'ltr',
        textAlign: isRTL ? 'right' : 'left',
        fontFamily: `"${fontFamily}", sans-serif`
    }}
/>

Database Strategy: Immutable Nested State

To keep the UI snappy and the database clean, clicking "Globalize" doesn't create six entirely new post documents. Instead, translations are saved as an immutable, nested array inside the parent English option.

If a user nudges the layout of the Spanish translation, we use a targeted, atomic MongoDB update ($set with arrayFilters) to update only that specific language's coordinates without corrupting the master English design.

// From sessionRoute.js
const result = await SessionData.updateOne(
  { sessionId: sessionId },
  { $set: { "history.$[h].variations.$[v].translations.$[t].bounding_box": boundingBox } },
  { arrayFilters: [
      { "h.variations.option_id": optionId },
      { "v.option_id": optionId },
      { "t.langCode": langCode }
  ]}
);

The Final Result & The Builder Lesson

(TODO: Insert GIF/Video of the app)

Inclusive, global products require thinking far beyond just replacing words in a JSON file. By leaning on powerful APIs like Lingo.dev and Gemini to handle the heavy linguistic and creative lifting, developers can focus their energy on building magical, physics-aware user interfaces that feel native to every user, everywhere.

Resources & Links

GitHub Repo: thirupathi8
Demo Video: Youtube
Lingo.dev SDK Docs: lingo.dev
Google Gemini API: ai.google.dev

DEV Community