Step-by-Step Guide: Build a Multi-Modal AI App with GPT-4o 2026 and React 19
Multi-modal AI applications that process both text and image inputs are transforming how users interact with technology. With the 2026 release of GPT-4o—OpenAI’s most advanced multi-modal model to date—and React 19’s streamlined component architecture, building these apps is more accessible than ever. This guide walks you through creating a full-featured multi-modal AI app that integrates text generation, image understanding, and image generation.
Prerequisites
- Node.js 20+ installed locally
- Valid OpenAI API key with GPT-4o 2026 access
- Basic familiarity with React and JavaScript ES6+
- React 19 compatible package manager (npm 10+ or yarn 1.22+)
Step 1: Initialize Your React 19 Project
We’ll use Vite to scaffold our React 19 project, as it offers faster build times and native React 19 support:
npm create vite@latest multi-modal-ai-app -- --template react
cd multi-modal-ai-app
npm install
Verify React 19 is installed by checking your package.json dependencies—you should see "react": "^19.0.0" and "react-dom": "^19.0.0".
Step 2: Install Required Dependencies
Install the OpenAI SDK for API calls and react-dropzone for handling image uploads:
npm install openai react-dropzone
Step 3: Configure OpenAI API Access
Create a .env file in your project root to store your OpenAI API key. Vite requires environment variables to be prefixed with VITE_ to expose them to the client:
VITE_OPENAI_API_KEY=your_openai_api_key_here
Next, create a src/lib/openai.js file to initialize the OpenAI client:
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: import.meta.env.VITE_OPENAI_API_KEY,
dangerouslyAllowBrowser: true, // Note: For production, proxy API calls through a backend
});
export default openai;
Note: The dangerouslyAllowBrowser flag is used here for simplicity. In production, always route OpenAI API calls through a secure backend to protect your API key.
Step 4: Build Core UI Components
We’ll structure the app with three core sections: text generation, image upload/multi-modal processing, and image generation. Update your src/App.jsx to include the following base structure:
import { useState } from 'react';
import openai from './lib/openai';
import { useDropzone } from 'react-dropzone';
function App() {
const [textPrompt, setTextPrompt] = useState('');
const [textResponse, setTextResponse] = useState('');
const [imagePrompt, setImagePrompt] = useState('');
const [uploadedImage, setUploadedImage] = useState(null);
const [imageBase64, setImageBase64] = useState('');
const [multiModalResponse, setMultiModalResponse] = useState('');
const [generatedImage, setGeneratedImage] = useState('');
const [loading, setLoading] = useState(false);
const [error, setError] = useState('');
// Dropzone config for image upload
const { getRootProps, getInputProps } = useDropzone({
accept: { 'image/*': [] },
onDrop: (acceptedFiles) => {
const file = acceptedFiles[0];
setUploadedImage(URL.createObjectURL(file));
const reader = new FileReader();
reader.onload = () => {
const base64 = reader.result.split(',')[1];
setImageBase64(base64);
};
reader.readAsDataURL(file);
},
});
// Text generation handler
const handleTextGeneration = async () => {
setLoading(true);
setError('');
try {
const completion = await openai.chat.completions.create({
model: 'gpt-4o-2026',
messages: [{ role: 'user', content: textPrompt }],
});
setTextResponse(completion.choices[0].message.content);
} catch (err) {
setError(`Text generation failed: ${err.message}`);
} finally {
setLoading(false);
}
};
// Multi-modal (image + text) processing handler
const handleMultiModalProcessing = async () => {
if (!imageBase64) {
setError('Please upload an image first.');
return;
}
setLoading(true);
setError('');
try {
const completion = await openai.chat.completions.create({
model: 'gpt-4o-2026',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: imagePrompt || 'Describe this image in detail.' },
{
type: 'image_url',
image_url: { url: `data:image/jpeg;base64,${imageBase64}` },
},
],
},
],
});
setMultiModalResponse(completion.choices[0].message.content);
} catch (err) {
setError(`Multi-modal processing failed: ${err.message}`);
} finally {
setLoading(false);
}
};
// Image generation handler
const handleImageGeneration = async () => {
if (!imagePrompt) {
setError('Please enter an image generation prompt.');
return;
}
setLoading(true);
setError('');
try {
const response = await openai.images.generate({
model: 'gpt-4o-2026-image',
prompt: imagePrompt,
n: 1,
size: '1024x1024',
});
setGeneratedImage(response.data[0].url);
} catch (err) {
setError(`Image generation failed: ${err.message}`);
} finally {
setLoading(false);
}
};
return (
Multi-Modal AI App
{error && {error}}
{loading && Processing...}
{/* Text Generation Section */}
Text Generation
setTextPrompt(e.target.value)}
placeholder="Enter your text prompt here..."
rows={4}
/>
<button onClick={handleTextGeneration} disabled={loading}>
Generate Text
</button>
{textResponse && (
<div className="response">
<h3>Response:</h3>
<p>{textResponse}</p>
</div>
)}
</section>
{/* Multi-Modal Processing Section */}
<section className="section">
<h2>Image + Text Processing</h2>
<div {...getRootProps()} className="dropzone">
<input {...getInputProps()} />
<p>Drag and drop an image here, or click to select</p>
</div>
{uploadedImage && (
<img src={uploadedImage} alt="Uploaded preview" className="preview" />
)}
<textarea
value={imagePrompt}
onChange={(e) => setImagePrompt(e.target.value)}
placeholder="Enter a prompt for the image (e.g., 'Describe this image')"
rows={4}
/>
<button onClick={handleMultiModalProcessing} disabled={loading}>
Process Image + Text
</button>
{multiModalResponse && (
<div className="response">
<h3>Response:</h3>
<p>{multiModalResponse}</p>
</div>
)}
</section>
{/* Image Generation Section */}
<section className="section">
<h2>Image Generation</h2>
<textarea
value={imagePrompt}
onChange={(e) => setImagePrompt(e.target.value)}
placeholder="Enter your image generation prompt..."
rows={4}
/>
<button onClick={handleImageGeneration} disabled={loading}>
Generate Image
</button>
{generatedImage && (
<div className="response">
<h3>Generated Image:</h3>
<img src={generatedImage} alt="Generated" className="preview" />
</div>
)}
</section>
</div>
);
}
export default App;</code></pre>
<h2>Step 5: Add Basic Styling</h2>
<p>Add minimal CSS to make the app usable. Update src/index.css with the following:</p>
<pre><code>* {
box-sizing: border-box;
margin: 0;
padding: 0;
}
body {
font-family: Arial, sans-serif;
line-height: 1.6;
padding: 20px;
background: #f5f5f5;
}
.app-container {
max-width: 1200px;
margin: 0 auto;
}
.section {
background: white;
padding: 20px;
margin-bottom: 20px;
border-radius: 8px;
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
}
h1, h2, h3 {
margin-bottom: 10px;
}
textarea {
width: 100%;
padding: 10px;
margin-bottom: 10px;
border: 1px solid #ccc;
border-radius: 4px;
}
button {
padding: 10px 20px;
background: #007bff;
color: white;
border: none;
border-radius: 4px;
cursor: pointer;
margin-bottom: 10px;
}
button:disabled {
background: #ccc;
cursor: not-allowed;
}
.dropzone {
border: 2px dashed #ccc;
padding: 40px;
text-align: center;
margin-bottom: 10px;
cursor: pointer;
border-radius: 4px;
}
.dropzone:hover {
border-color: #007bff;
}
.preview {
max-width: 300px;
margin: 10px 0;
border-radius: 4px;
}
.response {
margin-top: 10px;
padding: 10px;
background: #f9f9f9;
border-radius: 4px;
}
.error {
padding: 10px;
background: #ffcccc;
color: #cc0000;
border-radius: 4px;
margin-bottom: 10px;
}
.loading {
padding: 10px;
background: #ffffcc;
border-radius: 4px;
margin-bottom: 10px;
}</code></pre>
<h2>Step 6: Test Your Application</h2>
<p>Start the development server:</p>
<pre><code>npm run dev</code></pre>
<p>Open the provided local URL in your browser and test all three features:</p>
<ul>
<li>Enter a text prompt (e.g., "Explain quantum computing in simple terms") and click Generate Text</li>
<li>Upload an image, add an optional prompt, and click Process Image + Text</li>
<li>Enter an image generation prompt (e.g., "A futuristic city with flying cars at sunset") and click Generate Image</li>
</ul>
<h2>Production Considerations</h2>
<p>Before deploying your app, make sure to:</p>
<ul>
<li>Remove the dangerouslyAllowBrowser flag and proxy OpenAI API calls through a secure backend (e.g., Node.js/Express, Cloudflare Workers) to protect your API key</li>
<li>Add rate limiting to prevent API abuse</li>
<li>Implement proper error handling and user feedback</li>
<li>Optimize image uploads to reduce bandwidth usage</li>
</ul>
<h2>Conclusion</h2>
<p>You’ve now built a fully functional multi-modal AI app using GPT-4o 2026 and React 19. This app demonstrates core multi-modal capabilities: text generation, image understanding, and image generation. You can extend this further by adding features like chat history, multiple image uploads, or voice input integration.</p>
</article></x-turndown>
Top comments (0)