DEV Community: Arunav Maitra

Meet Persona-Portraits AI

Arunav Maitra — Mon, 15 Sep 2025 06:46:00 +0000

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

I built Persona-Portraits AI.

It’s a magical web experience that lets you become the hero of any story.

Ever wondered what you'd look like as an astronaut gazing at Earth? Or a cyberpunk rebel in a neon-drenched city? Now, you don't have to wonder.

Persona-Portraits AI solves a fun, creative challenge:
How can we reimagine ourselves in fantastical scenarios without complex editing software?

My applet provides the answer.

You simply upload your photo, pick a scene, and our AI assistant gets to work. It intelligently blends your face onto a new body, with new clothes, a new background, and a new attitude—all while keeping you recognizable.

It’s your personal digital costume designer and movie-set creator, all rolled into one.

Demo: visit here

Here’s a glimpse into the magic of Persona-Portraits AI.

Imagine a sleek, animated interface with a swirling galaxy in the background.

First, you're greeted by the bold, italic headline: Step Into Another World.

You click the glowing upload area, select your best selfie, and watch as interactive scenario cards slide into view. Each card—from 'Executive Drive' to 'Cosmic Explorer'—shimmers with possibility.

You tap on Enchanted Forest. The card glows with a vibrant purple border, confirming your choice.

With a deep breath, you press the big, beautiful, pulsating button:
"Transform My Photo"

Instantly, a mesmerizing loader appears, cycling through witty messages:

"Warming up the digital canvas..."
"Consulting with the art muses..."
"Almost there, adding the final touches..."

And then, it happens.

A breathtaking image fades in. It's you, but reimagined. You're an elf, with ethereal robes, standing in a forest lit by glowing mushrooms. The likeness is uncanny.

A stylish "Download Image" button appears, and with one click, your new persona is saved.

This is the seamless, powerful, and utterly fun experience of Persona-Portraits AI.

How I Used Google AI Studio

Google AI Studio was the creative heart of this project.

The entire application is powered by the phenomenal capabilities of the Gemini 2.5 Flash Image Preview model, also known as gemini-2.5-flash-image-preview.

This model is a wizard at understanding and editing images based on text commands.

My process involved:

Prototyping Prompts: I used Google AI Studio as a sandbox. I experimented with dozens of prompts to find the perfect phrasing. How do you ask an AI to change clothes but not a face? How do you describe a "cyberpunk" aesthetic? The studio gave me instant visual feedback.
Model Selection: I specifically chose gemini-2.5-flash-image-preview for its incredible balance of speed and quality in image manipulation tasks.
API Integration: Once the prompts were perfected, I integrated the @google/genai SDK into the app. The code directly calls the model with the user's image and the selected scenario's prompt, bringing the magic to life.

Without the power and flexibility of the Gemini models, this applet would not have been possible.

Multimodal Features

Persona-Portraits AI is multimodal at its very core. It thrives on the conversation between different types of data.

Here’s the breakdown:

Input 1 (Image): The user uploads their photograph. This is the visual anchor, the subject of our story.
Input 2 (Text): The user selects a scenario, which corresponds to a carefully crafted prompt. This is the narrative instruction, the plot of our story.

The Gemini model doesn't just process these inputs one after the other. It understands them together.

It looks at your face in the photo and comprehends the instruction: "Place this person in a luxury car... change their clothing to a business suit... keep the facial features identical."

This fusion of image and text understanding is what creates a believable, high-quality result. It’s not a simple filter or a cut-and-paste job. It’s a contextual transformation.

This multimodal approach enhances the user experience by offering:

Limitless Creativity: Any prompt can become a new reality.
Deep Personalization: The final image is uniquely yours, not a generic template.
Simplicity: Users don't need to be prompt engineers. They just pick a vibe, and the app handles the complex conversation with the AI.

By combining what the user looks like with what they want to be, we create a truly magical and personal piece of art.

Try ChromaFlip Chronicles

Arunav Maitra — Sun, 14 Sep 2025 18:37:00 +0000

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

I built ChromaFlip Chronicles, a digital experience that breathes new life into the classic photo album.

Imagine a scrapbook, but one that's alive.
One that's interactive.
And one that's powered by your own imagination.

That's ChromaFlip Chronicles.

It's a beautifully designed, hand-drawn style notebook that you can flip through, page by page. But here's the magic: it's not just a gallery. It's a creative canvas. Each page allows you to take a photo—a memory, a piece of art, a random snapshot—and completely remix it using the power of generative AI.

It solves a simple but profound problem: our digital photos often sit stagnant in folders. This applet turns passive viewing into an active, creative process, allowing anyone to become a digital artist and storyteller.

It’s your AI-powered visual diary, where memories are not just stored, but wonderfully reimagined.

Demo

Here's a look at the enchanting world of ChromaFlip Chronicles.

A live demo link for you to try

Screenshots:

A glimpse of the main notebook interface, where users can navigate through their visual diary.

Here, you can see the intuitive controls for remixing an image with a simple text prompt.

How I Used Google AI Studio

Google AI Studio was the creative engine behind this project. The star of the show is the Gemini 2.5 Flash Image Preview model (also known as nano-banana).

My entire application is built around its unique multimodal capabilities.

Here’s the technical breakdown:

The Request: When a user wants to "remix" an image, I send a request to the Gemini API using the @google/genai library.
Multimodal Input: This isn't just a text prompt. The request is multimodal because it sends two distinct pieces of information together:
- The user's existing image (as a base64 encoded string).
- The user's creative text prompt (e.g., "make this black and white photo burst with color").
The Magic: The gemini-2.5-flash-image-preview model understands how to interpret the text prompt as a set of instructions to edit the provided image.
The Response: The model then sends back a brand new, AI-generated image, which my app seamlessly displays on the notebook page.

It was surprisingly simple to implement, yet incredibly powerful in its results.

Multimodal Features

The core of ChromaFlip Chronicles is its multimodal functionality. It's not just a feature; it's the entire premise.

Why does this enhance the user experience?

It's Personal: Instead of generating images from scratch, users start with their own photos. This makes the creative process deeply personal and grounded in their own memories. You're not just creating art; you're transforming a piece of your own life.
It's Intuitive: The interaction is as simple as talking. You just tell the AI what you want to change about your picture. This removes the barrier of complex photo editing software and opens up creative expression to everyone.
It's A Creative Partnership: The multimodality—combining an image (what you have) with a text prompt (what you imagine)—creates a beautiful partnership between the user and the AI. It feels less like using a tool and more like collaborating with a creative partner who can instantly bring your ideas to life.

This fusion of image and text input is what makes ChromaFlip Chronicles a truly magical and engaging experience.

I built Element Fusion

Arunav Maitra — Sun, 14 Sep 2025 16:46:00 +0000

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

Ever had a wild, creative idea that was hard to put into words?

Maybe you imagined a cyberpunk cat, wearing your favorite sunglasses, majestically riding a cosmic whale through a nebula made of donuts.

Trying to generate that with text alone can be a challenge. The AI might not get the exact style of sunglasses or the specific look of the cat you envisioned.

That's the problem I wanted to solve.

So, I built Element Fusion.

Element Fusion isn't just another image generator. It's a visual alchemy engine. It's a creative playground where you provide the core ingredients.

Here’s the magic formula:

You upload the elements: That specific cat, those exact sunglasses, a picture of a whale. These are your non-negotiable visual assets.
You describe the scene: This is where you become the director. You write the prompt that ties everything together. "Create a photorealistic image of the cat wearing the sunglasses, riding the whale through a vibrant, swirling nebula..."
Element Fusion creates: The app uses the power of Gemini to intelligently understand and combine all your visual elements into one seamless, stunning, and often surprising new image.

It's an applet built for artists, designers, meme-makers, and anyone who wants to bring their most complex visual daydreams to life.

Demo

Check out the live applet here: Link to your deployed applet

Here’s a glimpse into the creative process:

Step 1: The Canvas Awaits

Our journey begins on a sleek, futuristic interface. The stage is set for your imagination to take flight.

A placeholder image showing the app's beautiful hero section.

Step 2: Assembling the Elements

Here, you upload your core visual components. For this masterpiece, we've chosen a noble cat, a futuristic city, and a classic car.

A placeholder image showing the file upload area populated with distinct images.

Step 3: Directing the Vision

With our elements in place, we write the prompt. This is our script for the AI, telling it precisely how to blend the images into a cohesive scene.

Step 4: The Fusion!

We hit the "Fuse Elements" button and watch the magic happen. Gemini gets to work, weaving our separate images into a single narrative. The result? A stunning, one-of-a-kind creation that was impossible to describe with words alone.

"A majestic cat driving the vintage car down the neon-lit main street of the futuristic city at night. The style should be cinematic and photorealistic."

A placeholder image of a breathtaking, AI-generated image that combines the uploaded elements.

How I Used Google AI Studio

Google AI Studio and the Gemini API are the heart and soul of this project.

My workflow was centered around the incredible capabilities of the gemini-2.5-flash-image-preview model, also known as the "nano-banana" model. This model is exceptionally good at understanding and manipulating image data.

Here's the technical breakdown:

Prototyping in AI Studio: Before writing a single line of code, I used Google AI Studio to test the core concept. I manually uploaded different combinations of images and wrote various text prompts to see how the model would respond. This was crucial for understanding its strengths and limitations, and for refining the prompt engineering strategy.
Multimodal Requests: The app's core function is sending a rich, multimodal request to the Gemini API using the @google/genai SDK.
- Each user-uploaded image is converted to a base64 string.
- These are then formatted as individual inlineData parts in the request payload.
- The user's written description is added as a final text part.

This means a single API call might contain multiple images and one text prompt—a truly multimodal instruction set.

Parsing the Response: The gemini-2.5-flash-image-preview model can return both a new image and a text description. My code is set up to parse the response, extract the new base64 image data to display it, and show any accompanying text from the model.

Multimodal Features

The multimodality here is deep and transformative for the creative process. This isn't just text-to-image; it's multi-image-and-text-to-image.

Why is this a game-changer?

Ultimate Specificity: It gives the user unprecedented control. Instead of vaguely describing "a cute dog," you can upload a picture of your dog. The AI then works with that specific visual information, preserving the unique character, breed, and even the lighting from your original photo.
Creative Cohesion: The text prompt acts as the narrative glue. It tells the model how to combine the provided visual elements. It sets the mood, the style, the action, and the environment. This synergy between the provided images (the what) and the text prompt (the how) allows for the creation of incredibly nuanced and personal images.
Enhanced User Experience: This approach transforms the user from a passive requester into an active co-creator. You are not just asking the AI to make something for you; you are collaborating with the AI, providing it with the key building blocks to assemble your vision. It feels less like a command and more like a creative partnership.

In short, Element Fusion leverages multimodality to create a powerful tool that respects the user's specific visual assets while using AI to weave them into something entirely new and magical.

Crystal Vision AI

Arunav Maitra — Sun, 14 Sep 2025 09:44:00 +0000

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

I built Crystal Vision AI.

My goal wasn't just to create another image generator.
I wanted to build an experience.

A magical portal where your ideas and photos are transformed into mystical works of art.

Crystal Vision AI solves a simple problem: How can we make AI art generation more personal and enchanting?

It does this in two powerful ways:

Enchant an Image: You can upload your own photo—of your pet, a friend, or a favorite object. Then, you provide a text prompt to magically edit it. The AI understands both the image and your words to create something entirely new.
Summon a Vision: For moments of pure imagination, you can simply describe a scene. The AI acts as your personal oracle, conjuring a stunning, photorealistic image from your words alone.

The core magic?

Every creation is beautifully and seamlessly encapsulated within a glowing, hyper-realistic crystal ball, turning every generation into a unique, mystical artifact.

It's a tool designed to spark joy, unleash creativity, and make you feel like a real magician.

Demo

Behold the magic in action!

🔮 Live Applet Link: Experience Crystal Vision AI Here!

Here's a glimpse into the visual journey:

The Grand Welcome

Users are greeted by an ethereal, animated interface that immediately sets a magical tone.

Enchanting a Personal Photo

Here, a user has uploaded a photo of their cat and is adding a prompt to give it a sparkling crown. Notice the simple, intuitive controls.

The Final Masterpiece

After a moment of 'consulting the oracle,' the final vision is revealed—a breathtaking image, perfectly rendered inside the crystal ball.

How I Used Google AI Studio

Google AI Studio was my digital alchemy lab. It was the crucial first step where I prototyped, tested, and truly understood the capabilities of the Gemini models before writing a single line of production code.

My two key ingredients were:

gemini-2.5-flash-image-preview: This was the absolute star of the show. Its powerful multimodal capabilities are the engine behind the "Enchant an Image" feature. I used the Studio to test how the model would interpret an uploaded image alongside a text prompt.
imagen-4.0-generate-001: This model is a pure powerhouse for text-to-image generation. It's the oracle that powers the "Summon a Vision" feature, creating stunningly detailed images from just a description.

My process involved countless iterations in the Studio to perfect the prompts. I fine-tuned phrases like "hyper-realistic, glowing crystal ball" and "sitting on a dark, mystical surface" to achieve the exact aesthetic I envisioned. This rapid prototyping saved hours of development time and ensured the final app produced consistently magical results.

Multimodal Features

The soul of Crystal Vision AI lies in its multimodal functionality.

Specifically, in the Enchant an Image mode.

This isn't just a simple image filter. It's a true creative conversation with the AI. The model processes two distinct types of information simultaneously:

Visual Input: The user's uploaded image. The AI doesn't just see pixels; it gains a contextual understanding of the subject and composition of the photo.
Textual Input: The user's typed command. This is where the user directs the magic, asking for specific changes like "add a wizard hat" or "make it look like it's made of stars."

The model then fuses these two inputs. It intelligently identifies the main subject from the image and applies the textual command to it, before reimagining the entire scene within the crystal ball theme.

Why does this enhance the user experience?

It makes the creation process deeply personal and interactive.

Users aren't just passive prompters; they are active collaborators with the AI. They can bring their own life and memories—their pets, their friends, their art—into the magical world.

This transforms the app from a simple generator into a powerful, personal creative companion. It's the profound difference between asking an AI to create a dragon, and asking it to give your beloved pet lizard a pair of majestic, fiery wings.

That is the magic of multimodality.
And that is the magic of Crystal Vision AI.

ArchiBlocks 3D

Arunav Maitra — Sat, 13 Sep 2025 08:55:37 +0000

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

I built ArchiBlocks 3D, a web application that magically transforms real-world architectural photos into captivating 3D block models.

Have you ever looked at a building and imagined what it would look like as a LEGO set, a clay model, or something straight out of a low-poly video game?

That's the core idea behind ArchiBlocks 3D. It bridges the gap between reality and imagination.

It's a simple, intuitive tool for:

Artists seeking inspiration.
Designers looking for a new way to visualize concepts.
Hobbyists who just want to have fun and see the world differently.

The app takes a user's uploaded image and a text prompt describing a desired style, and uses the power of Gemini to generate a brand new, stylized 3D version.

Demo : Live Here or play with it

Here’s a walkthrough of the experience. Imagine uploading a photo of the iconic Eiffel Tower...

First, the user is greeted by a dynamic hero section with an animated background, setting a creative tone.

Next, they scroll down to the generator. Here, they can drag-and-drop their architectural photo and type in a creative prompt. For this example, the prompt is: “Isometric low-poly 3D model, vibrant colors.”

After hitting the ✨ Generate 3D Model button, the magic happens! The AI gets to work, and in a few moments, the result is displayed in a beautiful side-by-side comparison.

Please note: This applet was built using the gemini-2.5-flash-image-preview model. The screenshots and descriptions here showcase its full functionality in action!

How I Used Google AI Studio

Google AI Studio was the engine behind this entire project. I specifically leveraged the Gemini API and its powerful multimodal capabilities.

My model of choice was gemini-2.5-flash-image-preview (also known as nanobanana), which is absolutely perfect for this kind of creative image editing task.

The implementation is centered in my services/geminiService.ts file. In it, I construct a generateContent request that includes:

An image part, containing the user's uploaded photo as a base64 encoded string.
A text part, which combines my instructions with the user's unique style prompt.
A config object where I specify that the response should include both Modality.IMAGE and Modality.TEXT.

This setup allows Gemini to understand the visual context from the image and the stylistic direction from the text, merging them to produce a completely new piece of art.

Multimodal Features

The core of ArchiBlocks 3D is its Image-and-Text-to-Image generation. This is a truly multimodal feature that enhances the user experience in a profound way.

Input 1 (Image): The user provides the visual foundation—a photo of a building or landscape. This sets the scene and defines the subject.
Input 2 (Text): The user provides the creative direction—a prompt like "a cute claymation model" or "a futuristic neon render."
Output (Image): The AI synthesizes both inputs to generate a new image that respects the structure of the original photo but completely reimagines it in the user's desired style.

Why is this so powerful?

Because it gives the user agency and control. Instead of a one-size-fits-all filter, it opens up an infinite canvas of possibilities. The user isn't just a passive observer; they are an active collaborator with the AI, co-creating a unique visual masterpiece.

This direct, creative dialogue between the user, their photo, and the AI is what makes ArchiBlocks 3D so engaging and fun.

BrickVerse AI

Arunav Maitra — Sat, 13 Sep 2025 08:36:27 +0000

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

I built BrickVerse AI.

It's a magical portal where imagination meets digital creation.

This applet solves a simple, yet wonderful problem: How do you visualize any city in the world as a vibrant, intricate LEGO masterpiece?

BrickVerse AI creates a delightful experience by allowing anyone, regardless of artistic skill, to become a master LEGO builder.

You can start with just a simple city name.
Type "Paris"... and watch the Eiffel Tower rise, brick by brick.

Or, you can upload a personal photo.
A snapshot from your last vacation... and see it completely reimagined as a bustling LEGO world.

The goal is to spark creativity and bring a sense of childlike wonder to digital art, powered by the incredible capabilities of generative AI.

Demo

You can experience the magic live right here:
Link to Deployed Applet

Here’s a little sneak peek into the world of BrickVerse AI!

1. The Sleek & Simple Interface: Choose your creative path - text or image.

2. Generating from Text: We typed "Tokyo" and the AI started building...

3. The Final Masterpiece: A stunning, photorealistic LEGO Tokyo, complete with cherry blossoms and iconic towers.

How I Used Google AI Studio

Google AI Studio was my digital workshop for this project. It was the perfect environment to explore, prototype, and harness the power of Google's latest multimodal models.

I primarily leveraged two phenomenal models:

imagen-4.0-generate-001: For the text-to-image generation. I used AI Studio to fine-tune my prompts, experimenting with different keywords like "photorealistic", "cinematic lighting", and "bustling with LEGO pedestrians" to achieve that perfect, lively LEGO aesthetic. The ability to quickly iterate in the studio was invaluable.
gemini-2.5-flash-image-preview: This model is the heart of the image-to-image feature. My entire prompt engineering for transforming an existing photo into a LEGO world was done within AI Studio. I crafted instructions that guided the model to recreate, not just overlay, the source image, ensuring every building, tree, and car was reimagined in brick form.

AI Studio made the process of integrating these powerful AI capabilities seamless and intuitive.

Multimodal Features

The true essence of BrickVerse AI lies in its multimodal nature. It's not just about one input; it's about offering creative flexibility.

1. From Words to Worlds (Text-to-Image)

This feature allows users to conjure a world from pure imagination.

How it works: A user provides a text string (e.g., "New York City"). The application then embeds this into a more detailed prompt and sends it to the imagen-4.0-generate-001 model.
Why it enhances the experience: It's the ultimate creative sandbox. You don't need a reference; you just need an idea. It makes the creation process incredibly accessible and limitless. You can dream of a LEGO Venice during a flood or a futuristic LEGO Dubai, and the AI will build it for you.

2. From Pixels to Plastic (Image-to-Image)

This feature makes the creation process deeply personal.

How it works: A user uploads an image. The app sends this image data along with a text prompt (e.g., "Transform this entire image into a vibrant, detailed LEGO city scene") to the gemini-2.5-flash-image-preview model.
Why it enhances the experience: This is where the magic becomes personal. Users can upload photos of their own hometown, a favorite landmark, or a cherished vacation spot. The AI doesn't just add a filter; it understands the context of the image and rebuilds it. Seeing a personal memory transformed into a work of LEGO art creates a powerful and engaging emotional connection for the user.

By combining both text and image inputs, BrickVerse AI caters to different creative impulses, making it a truly versatile and captivating multimodal application.