Step-by-Step Guide: Build a Multi-Modal AI App with GPT-4o 2026 and React 19 – Integrate Image and Text Generation

#stepbystep #guide #build #multimodal

Step-by-Step Guide: Build a Multi-Modal AI App with GPT-4o 2026 and React 19

Multi-modal AI applications that process both text and image inputs are transforming how users interact with technology. With the 2026 release of GPT-4o—OpenAI’s most advanced multi-modal model to date—and React 19’s streamlined component architecture, building these apps is more accessible than ever. This guide walks you through creating a full-featured multi-modal AI app that integrates text generation, image understanding, and image generation.

Prerequisites

Node.js 20+ installed locally
Valid OpenAI API key with GPT-4o 2026 access
Basic familiarity with React and JavaScript ES6+
React 19 compatible package manager (npm 10+ or yarn 1.22+)

Step 1: Initialize Your React 19 Project

We’ll use Vite to scaffold our React 19 project, as it offers faster build times and native React 19 support:

npm create vite@latest multi-modal-ai-app -- --template react
cd multi-modal-ai-app
npm install

Verify React 19 is installed by checking your package.json dependencies—you should see "react": "^19.0.0" and "react-dom": "^19.0.0".

Step 2: Install Required Dependencies

Install the OpenAI SDK for API calls and react-dropzone for handling image uploads:

npm install openai react-dropzone

Step 3: Configure OpenAI API Access

Create a .env file in your project root to store your OpenAI API key. Vite requires environment variables to be prefixed with VITE_ to expose them to the client:

VITE_OPENAI_API_KEY=your_openai_api_key_here

Next, create a src/lib/openai.js file to initialize the OpenAI client:

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: import.meta.env.VITE_OPENAI_API_KEY,
  dangerouslyAllowBrowser: true, // Note: For production, proxy API calls through a backend
});

export default openai;

Note: The dangerouslyAllowBrowser flag is used here for simplicity. In production, always route OpenAI API calls through a secure backend to protect your API key.

Step 4: Build Core UI Components

We’ll structure the app with three core sections: text generation, image upload/multi-modal processing, and image generation. Update your src/App.jsx to include the following base structure:

import { useState } from 'react';
import openai from './lib/openai';
import { useDropzone } from 'react-dropzone';

function App() {
  const [textPrompt, setTextPrompt] = useState('');
  const [textResponse, setTextResponse] = useState('');
  const [imagePrompt, setImagePrompt] = useState('');
  const [uploadedImage, setUploadedImage] = useState(null);
  const [imageBase64, setImageBase64] = useState('');
  const [multiModalResponse, setMultiModalResponse] = useState('');
  const [generatedImage, setGeneratedImage] = useState('');
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState('');

  // Dropzone config for image upload
  const { getRootProps, getInputProps } = useDropzone({
    accept: { 'image/*': [] },
    onDrop: (acceptedFiles) => {
      const file = acceptedFiles[0];
      setUploadedImage(URL.createObjectURL(file));
      const reader = new FileReader();
      reader.onload = () => {
        const base64 = reader.result.split(',')[1];
        setImageBase64(base64);
      };
      reader.readAsDataURL(file);
    },
  });

  // Text generation handler
  const handleTextGeneration = async () => {
    setLoading(true);
    setError('');
    try {
      const completion = await openai.chat.completions.create({
        model: 'gpt-4o-2026',
        messages: [{ role: 'user', content: textPrompt }],
      });
      setTextResponse(completion.choices[0].message.content);
    } catch (err) {
      setError(`Text generation failed: ${err.message}`);
    } finally {
      setLoading(false);
    }
  };

  // Multi-modal (image + text) processing handler
  const handleMultiModalProcessing = async () => {
    if (!imageBase64) {
      setError('Please upload an image first.');
      return;
    }
    setLoading(true);
    setError('');
    try {
      const completion = await openai.chat.completions.create({
        model: 'gpt-4o-2026',
        messages: [
          {
            role: 'user',
            content: [
              { type: 'text', text: imagePrompt || 'Describe this image in detail.' },
              {
                type: 'image_url',
                image_url: { url: `data:image/jpeg;base64,${imageBase64}` },
              },
            ],
          },
        ],
      });
      setMultiModalResponse(completion.choices[0].message.content);
    } catch (err) {
      setError(`Multi-modal processing failed: ${err.message}`);
    } finally {
      setLoading(false);
    }
  };

  // Image generation handler
  const handleImageGeneration = async () => {
    if (!imagePrompt) {
      setError('Please enter an image generation prompt.');
      return;
    }
    setLoading(true);
    setError('');
    try {
      const response = await openai.images.generate({
        model: 'gpt-4o-2026-image',
        prompt: imagePrompt,
        n: 1,
        size: '1024x1024',
      });
      setGeneratedImage(response.data[0].url);
    } catch (err) {
      setError(`Image generation failed: ${err.message}`);
    } finally {
      setLoading(false);
    }
  };

  return (

      Multi-Modal AI App
      {error && {error}}
      {loading && Processing...}

      {/* Text Generation Section */}

        Text Generation
         setTextPrompt(e.target.value)}
          placeholder="Enter your text prompt here..."
          rows={4}
        />
        <button onClick={handleTextGeneration} disabled={loading}>
          Generate Text
        </button>
        {textResponse && (
          <div className="response">
            <h3>Response:</h3>
            <p>{textResponse}</p>
          </div>
        )}
      </section>

      {/* Multi-Modal Processing Section */}
      <section className="section">
        <h2>Image + Text Processing</h2>
        <div {...getRootProps()} className="dropzone">
          <input {...getInputProps()} />
          <p>Drag and drop an image here, or click to select</p>
        </div>
        {uploadedImage && (
          <img src={uploadedImage} alt="Uploaded preview" className="preview" />
        )}
        <textarea
          value={imagePrompt}
          onChange={(e) => setImagePrompt(e.target.value)}
          placeholder="Enter a prompt for the image (e.g., 'Describe this image')"
          rows={4}
        />
        <button onClick={handleMultiModalProcessing} disabled={loading}>
          Process Image + Text
        </button>
        {multiModalResponse && (
          <div className="response">
            <h3>Response:</h3>
            <p>{multiModalResponse}</p>
          </div>
        )}
      </section>

      {/* Image Generation Section */}
      <section className="section">
        <h2>Image Generation</h2>
        <textarea
          value={imagePrompt}
          onChange={(e) => setImagePrompt(e.target.value)}
          placeholder="Enter your image generation prompt..."
          rows={4}
        />
        <button onClick={handleImageGeneration} disabled={loading}>
          Generate Image
        </button>
        {generatedImage && (
          <div className="response">
            <h3>Generated Image:</h3>
            <img src={generatedImage} alt="Generated" className="preview" />
          </div>
        )}
      </section>
    </div>
  );
}

export default App;</code></pre>

  <h2>Step 5: Add Basic Styling</h2>
  <p>Add minimal CSS to make the app usable. Update src/index.css with the following:</p>
  <pre><code>* {
  box-sizing: border-box;
  margin: 0;
  padding: 0;
}

body {
  font-family: Arial, sans-serif;
  line-height: 1.6;
  padding: 20px;
  background: #f5f5f5;
}

.app-container {
  max-width: 1200px;
  margin: 0 auto;
}

.section {
  background: white;
  padding: 20px;
  margin-bottom: 20px;
  border-radius: 8px;
  box-shadow: 0 2px 4px rgba(0,0,0,0.1);
}

h1, h2, h3 {
  margin-bottom: 10px;
}

textarea {
  width: 100%;
  padding: 10px;
  margin-bottom: 10px;
  border: 1px solid #ccc;
  border-radius: 4px;
}

button {
  padding: 10px 20px;
  background: #007bff;
  color: white;
  border: none;
  border-radius: 4px;
  cursor: pointer;
  margin-bottom: 10px;
}

button:disabled {
  background: #ccc;
  cursor: not-allowed;
}

.dropzone {
  border: 2px dashed #ccc;
  padding: 40px;
  text-align: center;
  margin-bottom: 10px;
  cursor: pointer;
  border-radius: 4px;
}

.dropzone:hover {
  border-color: #007bff;
}

.preview {
  max-width: 300px;
  margin: 10px 0;
  border-radius: 4px;
}

.response {
  margin-top: 10px;
  padding: 10px;
  background: #f9f9f9;
  border-radius: 4px;
}

.error {
  padding: 10px;
  background: #ffcccc;
  color: #cc0000;
  border-radius: 4px;
  margin-bottom: 10px;
}

.loading {
  padding: 10px;
  background: #ffffcc;
  border-radius: 4px;
  margin-bottom: 10px;
}</code></pre>

  <h2>Step 6: Test Your Application</h2>
  <p>Start the development server:</p>
  <pre><code>npm run dev</code></pre>
  <p>Open the provided local URL in your browser and test all three features:</p>
  <ul>
    <li>Enter a text prompt (e.g., "Explain quantum computing in simple terms") and click Generate Text</li>
    <li>Upload an image, add an optional prompt, and click Process Image + Text</li>
    <li>Enter an image generation prompt (e.g., "A futuristic city with flying cars at sunset") and click Generate Image</li>
  </ul>

  <h2>Production Considerations</h2>
  <p>Before deploying your app, make sure to:</p>
  <ul>
    <li>Remove the dangerouslyAllowBrowser flag and proxy OpenAI API calls through a secure backend (e.g., Node.js/Express, Cloudflare Workers) to protect your API key</li>
    <li>Add rate limiting to prevent API abuse</li>
    <li>Implement proper error handling and user feedback</li>
    <li>Optimize image uploads to reduce bandwidth usage</li>
  </ul>

  <h2>Conclusion</h2>
  <p>You’ve now built a fully functional multi-modal AI app using GPT-4o 2026 and React 19. This app demonstrates core multi-modal capabilities: text generation, image understanding, and image generation. You can extend this further by adding features like chat history, multiple image uploads, or voice input integration.</p>
</article></x-turndown>